问题描述
我有一个Data Lake,它有一个RAW,Staging和Curated区域。 当我"烹饪"时我的数据我将它与策划文件合并。 我有一些非常大的精选文件 - 超过100GB的数据和十亿条记录并且还在增长。
I have a Data Lake that has a RAW, Staging, and Curated area. When I am "cooking" my data I am merging it with the curated files. I have some files in curated that are extremely large - over 100gb of data and a billion records and growing.
要从RAW获取数据并进入Curated,我正在检查对于已经存在但需要"更新"的记录。 基本上我每天都在重新处理Curated数据。 这样做成本很高。 在日常流程中将所有RAW数据放入Curated文件夹的建议方法是
。 是否有更好的处理方式,以便我不必每天处理所有数据?
To get the data from RAW and into Curated, I am checking for records that already exist but need to be "updated". Essentially I am reprocessing the Curated data every day. This is very costly to do. What is the recommended method to get all the RAW data into a Curated folder on a daily processes. Is there a better way to process so that I don't have to process all the data over every day?
推荐答案
有一个很棒的幻灯片,介绍ADLS的最佳实践和性能优化。请查看 这个
套牌。
There's a great slide deck that talks about the best practices and performance optimizations on ADLS. Please have a look at this deck.
如果有帮助,请告诉我们。另外,我们可以继续进一步探讨。
Let us know if it helps you. Else, we can continue to probe in further.
这篇关于如何在ADLS和U-SQL中获得更好的大文件性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!