本文介绍了从CSV导入时,与HDF5相比,为什么 pandas 和dask性能更好?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用当前使用大型(> 5GB).csv文件运行的系统。为了提高性能,我正在测试(A)从磁盘创建数据帧的不同方法(熊猫VS ),以及(B)将结果存储到磁盘的其他方法(.csv VS 文件)。

I am working with a system that currently operates with large (>5GB) .csv files. To increase performance, I am testing (A) different methods to create dataframes from disk (pandas VS dask) as well as (B) different ways to store results to disk (.csv VS hdf5 files).

为了测试性能,我做了以下操作:

In order to benchmark performance, I did the following:

def dask_read_from_hdf():
    results_dd_hdf = dd.read_hdf('store.h5', key='period1', columns = ['Security'])
    analyzed_stocks_dd_hdf =  results_dd_hdf.Security.unique()
    hdf.close()

def pandas_read_from_hdf():
    results_pd_hdf = pd.read_hdf('store.h5', key='period1', columns = ['Security'])
    analyzed_stocks_pd_hdf =  results_pd_hdf.Security.unique()
    hdf.close()

def dask_read_from_csv():
    results_dd_csv = dd.read_csv(results_path, sep = ",", usecols = [0], header = 1, names = ["Security"])
    analyzed_stocks_dd_csv =  results_dd_csv.Security.unique()

def pandas_read_from_csv():
    results_pd_csv = pd.read_csv(results_path, sep = ",", usecols = [0], header = 1, names = ["Security"])
    analyzed_stocks_pd_csv =  results_pd_csv.Security.unique()

print "dask hdf performance"
%timeit dask_read_from_hdf()
gc.collect()
print""
print "pandas hdf performance"
%timeit pandas_read_from_hdf()
gc.collect()
print""
print "dask csv performance"
%timeit dask_read_from_csv()
gc.collect()
print""
print "pandas csv performance"
%timeit pandas_read_from_csv()
gc.collect()

我的发现是:

dask hdf performance
10 loops, best of 3: 133 ms per loop

pandas hdf performance
1 loop, best of 3: 1.42 s per loop

dask csv performance
1 loop, best of 3: 7.88 ms per loop

pandas csv performance
1 loop, best of 3: 827 ms per loop

当可以比.csv更快地访问hdf5存储,并且dask比pandas更快地创建数据帧时,为什么hdf5的dask速度比csv的d​​ask慢?我在做错什么吗?

When hdf5 storage can be accessed faster than .csv, and when dask creates dataframes faster than pandas, why is dask from hdf5 slower than dask from csv? Am I doing something wrong?

何时从HDF5存储对象创建dask数据帧对性能有意义?

推荐答案

HDF5在处理数字数据时效率最高,我猜您正在读取单个字符串列,这是它的弱点。

HDF5 is most efficient when working with numerical data, I'm guessing you are reading a single string column, which is its weakpoint.

通过使用 Categorical 来存储字符串,并假设基数相对较低(高数字),​​可以大大提高HDF5字符串数据的性能。重复值)

Performance of string data with HDF5 can be dramatically improved by using a Categorical to store your strings, assuming relatively low cardinality (high number of repeated values)

这是从前不久开始的,但是这里的一篇不错的博客文章正是经过了这些考虑。

It's from a little while back, but a good blog post here going through exactly these considerations.http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization

您可能还会看到使用镶木地板-它类似于HDF5,因为它是二进制格式,但是

You may also look at using parquet - it is similar to HDF5 in that it is a binary format, but is column oriented, so a single column selection like this will likely be faster.

最近(2016年至2017年),为实现对镶木地板的快速本机读取器进行了大量工作, > pandas,下一个主要发行版本的熊猫( 0.21 )将具有 to_parquet pd内置的.read_parquet 函数。

Recently (2016-2017) there has been significant work to implement a fast native reader of parquet->pandas, and the next major release of pandas (0.21) will have to_parquet and pd.read_parquet functions built in.

这篇关于从CSV导入时,与HDF5相比,为什么 pandas 和dask性能更好?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-19 01:00