问题描述
我正在将大量(100s至1000s)实木复合地板文件读入单个dask数据帧(单机,全部本地).我意识到
I'm reading a larger number (100s to 1000s) of parquet files into a single dask dataframe (single machine, all local). I realized that
files = ['file1.parq', 'file2.parq', ...]
ddf = dd.read_parquet(files, engine='fastparquet')
ddf.groupby(['col_A', 'col_B']).value.sum().compute()
的效率 比
from dask import delayed
from fastparquet import ParquetFile
@delayed
def load_chunk(pth):
return ParquetFile(pth).to_pandas()
ddf = dd.from_delayed([load_chunk(f) for f in files])
ddf.groupby(['col_A', 'col_B']).value.sum().compute()
对于我的特定应用程序,第二种方法( from_delayed
)需要6秒钟才能完成,第一种方法需要39秒.在 dd.read_parquet
案例中,在工人甚至开始做某事之前似乎有很多开销,并且有很多 transfer -...
操作分散跨任务流图.我想了解这里发生了什么. read_parquet
方法慢得多的原因可能是什么?它与仅读取文件并将它们分成小块有什么不同?
For my particular application, the second approach (from_delayed
) takes 6 seconds to complete, the first approach takes 39 seconds. In the dd.read_parquet
case there seems to be a lot of overhead before the workers even start to do something, and there are quite a few transfer-...
operations scattered across the task stream plot. I'd like to understand what's going on here. What could be the reason that the read_parquet
approach is so much slower? What does it do differently than just reading the files and putting them in chunks?
推荐答案
您正在体验客户端试图建立数据列的最小/最大统计信息,从而为数据帧建立良好索引的经历.索引对于防止从您的特定工作不需要的数据文件中读取数据非常有用.
You are experiencing the client trying to establish the min/max statistics of the columns of the data, and thereby establish a good index for the dataframe. An index can be very useful in preventing reading from data files which are not needed for your particular job.
在许多情况下,这是一个好主意,其中文件中的数据量很大,而文件的总数很小.在其他情况下,相同的信息可能包含在特殊的"_metadata"文件中,因此无需先读取所有文件.
In many cases, this is a good idea, where the amount of data in a file is large and the total number of files is small. In other cases, the same information might be contained in a special "_metadata" file, so that there would be no need to read from all the files first.
为防止扫描文件的页脚,应致电
To prevent the scan of the files' footers, you should call
dd.read_parquet(..,. gather_statistics=False)
这应该是dask下一版本的默认设置.
This should be the default in the next version of dask.
这篇关于读取大量的实木复合地板文件:read_parquet与from_delayed的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!