本文介绍了如何加快 pandas 多级数据帧的总和?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试加快几个大型多级数据帧的总和.

I am trying to speed up the sum for several big multilevel dataframes.

以下是示例:

df1 = mul_df(5000,30,400) # mul_df to create a big multilevel dataframe
#let df2, df3, df4 = df1, df1, df1 to minimize the memory usage, 
#they can also be mul_df(5000,30,400) 
df2, df3, df4 = df1, df1, df1

In [12]: timeit df1+df2+df3+df4
1 loops, best of 3: 993 ms per loop

我对993ms感到不满意,有什么办法可以加快速度吗? cython可以提高性能吗?如果是,如何编写cython代码?谢谢.

I am not satisfy with the 993ms, Is there any way to speed up ? Can cython improve the performance ? If yes, how to write the cython code ? Thanks.

注意:mul_df()是用于创建演示多层数据帧的功能.

Note:mul_df() is the function to create the demo multilevel dataframe.

import itertools
import numpy as np
import pandas as pd

def mul_df(level1_rownum, level2_rownum, col_num, data_ty='float32'):
    ''' create multilevel dataframe, for example: mul_df(4,2,6)'''

    index_name = ['STK_ID','RPT_Date']
    col_name = ['COL'+str(x).zfill(3) for x in range(col_num)]

    first_level_dt = [['A'+str(x).zfill(4)]*level2_rownum for x in range(level1_rownum)]
    first_level_dt = list(itertools.chain(*first_level_dt)) #flatten the list
    second_level_dt = ['B'+str(x).zfill(3) for x in range(level2_rownum)]*level1_rownum

    dt = pd.DataFrame(np.random.randn(level1_rownum*level2_rownum, col_num), columns=col_name, dtype = data_ty)
    dt[index_name[0]] = first_level_dt
    dt[index_name[1]] = second_level_dt

    rst = dt.set_index(index_name, drop=True, inplace=False)
    return rst

更新:

我的Pentium双核T4200 @ 2.00GHZ,3.00GB RAM,WindowXP,Python 2.7.4,Numpy 1.7.1,Pandas 0.11.0,numexpr 2.0.1(Anaconda 1.5.0(32位))上的数据)

Data on my Pentium Dual-Core [email protected], 3.00GB RAM, WindowXP, Python 2.7.4, Numpy 1.7.1, Pandas 0.11.0, numexpr 2.0.1 (Anaconda 1.5.0 (32-bit))

In [1]: from pandas.core import expressions as expr
In [2]: import numexpr as ne

In [3]: df1 = mul_df(5000,30,400)
In [4]: df2, df3, df4 = df1, df1, df1

In [5]: expr.set_use_numexpr(False)
In [6]: %timeit df1+df2+df3+df4
1 loops, best of 3: 1.06 s per loop

In [7]: expr.set_use_numexpr(True)
In [8]: %timeit df1+df2+df3+df4
1 loops, best of 3: 986 ms per loop

In [9]: %timeit  DataFrame(ne.evaluate('df1+df2+df3+df4'),columns=df1.columns,index=df1.index,dtype='float32')
1 loops, best of 3: 388 ms per loop

推荐答案

方法1:在我的机器上还不错(禁用了numexpr)

method 1: On my machine not so bad (with numexpr disabled)

In [41]: from pandas.core import expressions as expr

In [42]: expr.set_use_numexpr(False)

In [43]: %timeit df1+df2+df3+df4
1 loops, best of 3: 349 ms per loop

方法2:使用numexpr(如果已安装numexpr,则默认启用)

method 2: Using numexpr (which is by default enabled if numexpr is installed)

In [44]: expr.set_use_numexpr(True)

In [45]: %timeit df1+df2+df3+df4
10 loops, best of 3: 173 ms per loop

方法3:直接使用numexpr

In [34]: import numexpr as ne

In [46]: %timeit  DataFrame(ne.evaluate('df1+df2+df3+df4'),columns=df1.columns,index=df1.index,dtype='float32')
10 loops, best of 3: 47.7 ms per loop

使用numexpr可以实现这些加速,因为:

These speedups are achieved using numexpr because:

  • 避免使用中间临时数组(在您出现的情况下,这可能是 numpy效率很低,我怀疑这是像((df1+df2)+df3)+df4
  • 那样进行评估的
  • 使用可用的多核
  • avoids using intermediate temporary arrays (which in the case you are presenting is probably quite inefficient in numpy, I suspect this is being evaluated like ((df1+df2)+df3)+df4
  • uses multi-cores as available

正如我在上面暗示的那样,pandas在某些类型的操作(例如0.11)中使用numexpr作为背景. df1 + df2将以这种方式求值,但是您在此处给出的示例将导致多次调用numexpr(这是方法2比方法1更快).使用直接(方法3)ne.evaluate(...)可以实现更大的加速.

As I hinted above, pandas uses numexpr under the hood for certain types of ops (in 0.11), e.g. df1 + df2 would be evaluated this way, however the example you are giving here will result in several calls to numexpr (this is method 2 is faster than method 1.). Using the direct (method 3) ne.evaluate(...) achieves even more speedups.

请注意,在熊猫0.13(本周将发布0.12)中,我们实现了一个函数pd.eval,该函数实际上将执行上述示例中的操作.请继续关注(如果您喜欢冒险,它将很快成为大师: https://github.com/pydata/pandas/pull /4037 )

Note that in pandas 0.13 (0.12 will be released this week), we are implemented a function pd.eval which will in effect do exactly what my example above does. Stay tuned (if you are adventurous this will be in master somewhat soon: https://github.com/pydata/pandas/pull/4037)

In [5]: %timeit pd.eval('df1+df2+df3+df4')
10 loops, best of 3: 50.9 ms per loop

最后一个回答您的问题,cython根本无济于事; numexpr在这类问题上非常有效(也就是说,在 情况下,cython很有帮助)

Lastly to answer your question, cython will not help here at all; numexpr is quite efficient at this type of problem (that said, there are situation where cython is helpful)

一个警告:为了使用直接的Numexpr方法,帧应该已经对齐(Numexpr在numpy数组上运行,并且对索引一无所知).而且它们应该是单个dtype

One caveat: in order to use the direct Numexpr method the frames should be already aligned (Numexpr operates on the numpy array and doesn't know anything about the indices). also they should be a single dtype

这篇关于如何加快 pandas 多级数据帧的总和?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-19 22:56