问题描述
我的需求如下:-需要从第三方API提取数据到SQL Azure中.
My needs are following:- Need to fetch data from a 3rd party API into SQL azure.
-
每天都会查询API的增量数据,并且可能需要分页,因为默认情况下,任何API响应只会给出前N条记录.
The API's will be queried everyday for incremental data and may require pagination as by default any API response will give only Top N records.
API还需要一个auth令牌才能工作,这是我们开始从端点下载数据之前的第一个调用.
The API also needs an auth token to work, which is the first call before we start downloading data from endpoints.
由于最后两个原因,我选择了每天都会触发的Function App,而不是可以查询Web API的数据工厂.
Due to last two reasons, I've opted for Function App which will be triggered daily rather than data factory which can query web APIs.
是否有更好的方法可以做到这一点?我也在考虑将所有JSON推送到Blob存储中,然后将数据从JSON解析到SQL Azure中.有什么建议吗?
Is there a better way to do this?Also I am thinking of pushing all JSON into Blob store and then parsing data from the JSON into SQL Azure. Any recommendations?
推荐答案
要考虑规模问题的一件事是并行化查询和处理.如果没有订购要求,或者处理 all 记录将花费超过10分钟的函数超时时间.或者,如果您想对数据进行一些调整/转换,或者对于不同类型的数据有不同的目的地.或者,如果您想避免发生故障-例如,您的函数在处理过程中途失败,并且您不想重新查询API.或者,您以不同的方式获取数据,并希望在流程中的特定步骤开始处理(而不是从入口点开始运行).种种原因.
One thing to consider for scale would be to parallelize both the query and the processing. If there is no ordering requirement, or if processing all records would take longer than the 10 minute function timeout. Or if you want to do some tweaking/transformation of the data in-flight, or if you have different destinations for different types of data. Or if you want to be insulated from a failure - e.g., your function fails halfway through processing and you don't want to re-query the API. Or you get data a different way and want to start processing at a specific step in the process (rather than running from the entry point). All sorts of reasons.
在此我要说明的是,最佳并行度与复杂度在很大程度上取决于您的舒适度和要求.下面的示例有点极端"的示例,该过程将过程分解为离散的步骤,并对每个步骤使用一个函数.在某些情况下,将特定步骤拆分并将其合并为一个步骤可能没有意义. 持久功能也有助于编排这可能更容易.
I'll caveat here to say that the best degree of parallelism vs complexity is largely up to your comfort level and requirements. The example below is somewhat of an 'extreme' example of decomposing the process into discrete steps and using a function for each one; in some cases it may not make sense to split specific steps and combine them into a single one. Durable Functions also help make orchestration of this potentially easier.
- 由计时器驱动的函数,该函数查询API以了解所需页面的深度,或将其他页面排队到实际上进行页面API调用的第二个函数中
- 然后该函数查询API,并将其写入暂存区域(如Blob)或将每一行放入要写入/处理的队列中(例如,类似存储队列的东西,因为它们既便宜又快速,或者如果有多个方面(例如pub/sub)感兴趣,则将服务总线排队
- 如果要写入暂存blob,则触发blob的函数会读取blob,并将对队列的单个写入操作排队(例如,存储队列,因为对于这样的事情,存储队列既便宜又快速)
- 另一个由队列触发的函数实际上负责处理将各个行写入行中的下一个系统,SQL或其他任何方式.
您将获得一些并行化,并具有从过程中的任何步骤开始并带有正确格式的消息的能力.如果您的处理器遇到不良数据,则诸如中毒队列/死信队列之类的事件将有助于处理异常情况,因此,您可以手动修复不良数据,而不是整个过程快要死掉.
You'll get some parallelization out of that, plus the ability to start from any step in the process, with a correctly-formatted message. If your processors encounter bad data, things like poison queues/dead letter queues would help with exception cases, so instead of your entire process dying, you can manually remediate the bad data.
这篇关于将来自REST API的增量数据引入SQL Azure的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!