SQL中获得更好的大文件性能

SQL中获得更好的大文件性能

本文介绍了如何在ADLS和U-SQL中获得更好的大文件性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Data Lake,它有一个RAW,Staging和Curated区域。 当我"烹饪"时我的数据我将它与策划文件合并。 我有一些非常大的精选文件 - 超过100GB的数据和十亿条记录并且还在增长。  

I have a Data Lake that has a RAW, Staging, and Curated area.  When I am "cooking" my data I am merging it with the curated files.  I have some files in curated that are extremely large - over 100gb of data and a billion records and growing.  

要从RAW获取数据并进入Curated,我正在检查对于已经存在但需要"更新"的记录。 基本上我每天都在重新处理Curated数据。 这样做成本很高。 在日常流程中将所有RAW数据放入Curated文件夹的建议方法是
。 是否有更好的处理方式,以便我不必每天处理所有数据?

To get the data from RAW and into Curated, I am checking for records that already exist but need to be "updated".  Essentially I am reprocessing the Curated data every day.  This is very costly to do.  What is the recommended method to get all the RAW data into a Curated folder on a daily processes.  Is there a better way to process so that I don't have to process all the data over every day?

推荐答案

有一个很棒的幻灯片,介绍ADLS的最佳实践和性能优化。请查看  这个
套牌

There's a great slide deck that talks about the best practices and performance optimizations on ADLS. Please have a look at this deck.

如果有帮助,请告诉我们。另外,我们可以继续进一步探讨。

Let us know if it helps you. Else, we can continue to probe in further.


这篇关于如何在ADLS和U-SQL中获得更好的大文件性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-01 06:13