我正在使用Dask和python测试对实木复合地板文件的读取速度,并且发现使用pandas读取同一文件的速度明显比Dask快。我想了解为什么会这样,如果有办法获得均等的效果,
版本所有相关软件包print(dask.__version__) print(pd.__version__) print(pyarrow.__version__) print(fastparquet.__version__)
2.6.0 0.25.2 0.15.1 0.3.2
import pandas as pd
import numpy as np
import dask.dataframe as dd
col = [str(i) for i in list(np.arange(40))]
df = pd.DataFrame(np.random.randint(0,100,size=(5000000, 4 * 10)), columns=col)
df.to_parquet('large1.parquet', engine='pyarrow')
# Wall time: 3.86 s
df.to_parquet('large2.parquet', engine='fastparquet')
# Wall time: 27.1 s
df = dd.read_parquet('large2.parquet', engine='fastparquet').compute()
# Wall time: 5.89 s
df = dd.read_parquet('large1.parquet', engine='pyarrow').compute()
# Wall time: 4.84 s
df = pd.read_parquet('large1.parquet',engine='pyarrow')
# Wall time: 503 ms
df = pd.read_parquet('large2.parquet',engine='fastparquet')
# Wall time: 4.12 s
当使用混合数据类型数据帧时,差异较大。
dtypes: category(7), datetime64[ns](2), float64(1), int64(1), object(9)
memory usage: 973.2+ MB
# df.shape == (8575745, 20)
df.to_parquet('large1.parquet', engine='pyarrow')
# Wall time: 9.67 s
df.to_parquet('large2.parquet', engine='fastparquet')
# Wall time: 33.3 s
# read with Dask
df = dd.read_parquet('large1.parquet', engine='pyarrow').compute()
# Wall time: 34.5 s
df = dd.read_parquet('large2.parquet', engine='fastparquet').compute()
# Wall time: 1min 22s
# read with pandas
df = pd.read_parquet('large1.parquet',engine='pyarrow')
# Wall time: 8.67 s
df = pd.read_parquet('large2.parquet',engine='fastparquet')
# Wall time: 21.8 s
最佳答案
我的第一个猜测是,Pandas将Parquet数据集保存到单个行组中,这不允许像Dask这样的系统进行并行化。这并不能解释为什么它变慢,但是确实可以解释为什么它不变慢。
有关更多信息,我建议进行概要分析。您可能对此文档感兴趣:
https://docs.dask.org/en/latest/understanding-performance.html
关于python - 为什么Dask读取 Parquet 文件的速度比 Pandas 读取相同 Parquet 文件的速度慢得多?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/58820760/