问题描述
我在非常大的数据集上使用了如下所示的pandas grouby均值函数:
I was using the pandas grouby mean function like the following on a very large dataset:
import pandas as pd
df=pd.read_csv("large_dataset.csv")
df.groupby(['variable']).mean()
函数似乎没有使用多处理功能,因此,我实现了并行版本:
It looks like the function is not using multi-processing, and therefore, I implemented a paralleled version:
import pandas as pd
from multiprocessing import Pool, cpu_count
def meanFunc(tmp_name, df_input):
df_res=df_input.mean().to_frame().transpose()
return df_res
def applyParallel(dfGrouped, func):
num_process=int(cpu_count())
with Pool(num_process) as p:
ret_list=p.starmap(func, [[name, group] for name, group in dfGrouped])
return pd.concat(ret_list)
applyParallel(df.groupby(['variable']), meanFunc)
但是,熊猫的实现似乎仍然比我的并行实现快前进.
However, it seems that pandas implementation is still way faster than my parallel implementation.
我正在查看源代码,我看到它正在使用cython.那是原因吗?
I am looking at the source code for pandas groupby, and I see that it is using cython. Is that the reason?
def _cython_agg_general(self, how, alt=None, numeric_only=True,
min_count=-1):
output = {}
for name, obj in self._iterate_slices():
is_numeric = is_numeric_dtype(obj.dtype)
if numeric_only and not is_numeric:
continue
try:
result, names = self.grouper.aggregate(obj.values, how,
min_count=min_count)
except AssertionError as e:
raise GroupByError(str(e))
output[name] = self._try_cast(result, obj)
if len(output) == 0:
raise DataError('No numeric types to aggregate')
return self._wrap_aggregated_output(output, names)
推荐答案
简短答案-如果需要,请使用黄昏这类情况的并行性.您在方法上有陷阱,这是可以避免的.它可能仍未达到更快的速度,但可以为您提供最佳的拍摄效果,是熊猫的主要替代品.
Short answer - use dask if you want parallelism for these type of cases. You have pitfalls in your approach that it avoids. It still might not be faster, but will give you the best shot and is a largely drop-in replacement for pandas.
更长的答案
1)并行性从本质上增加了开销,因此理想情况下,您要并行执行的操作会有些昂贵.累加数字并不是特别重要-您在这里使用cython是正确的,您正在查看的代码是调度逻辑.实际的核心cython是此处,可以转换为非常简单的c循环.
1) Parallelism inherently adds overhead, so ideally the operation you're paralleling is somewhat expensive. Adding up numbers isn't especially - you're right that cython is used here, the code you're looking at is dispatch logic. The actual core cython is here, which translates down to a very simple c-loop.
2)您正在使用多重处理-这意味着每个进程都需要获取数据的副本.这很贵.通常,由于GIL,您必须在python中执行此操作-实际上,您可以在dask中使用线程,并且dask确实可以在这里使用线程,因为pandas操作在C中并释放GIL.
2) You're using multi-processing - which means that each process needs to take a copy of the data. This is expensive. Normally you have to do this in python because of the GIL - you actually can (and dask does) use threads here, because the pandas operation is in C and releases the GIL.
3)正如@AKX在注释中指出的那样-并行化(... name, group in dfGrouped
)之前的迭代也相对昂贵-它为每个组构造新的子数据帧.原始的pandas算法会对数据进行迭代.
3) As @AKX noted in the comments - the iteration before you parallelize (... name, group in dfGrouped
) is also relatively expensive - its constructing new sub data frames for each group. The original pandas algorithm iterates over the data in place.
这篇关于为什么pandas.grouby.mean比并行实现要快得多的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!