本文介绍了 pandas 分组后并行应用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
例如,我已经使用rosetta.parallel.pandas_easy
将apply
并行化到apply
,
I have used rosetta.parallel.pandas_easy
to parallelize apply
after groupby
, for example:
from rosetta.parallel.pandas_easy import groupby_to_series_to_frame
df = pd.DataFrame({'a': [6, 2, 2], 'b': [4, 5, 6]},index= ['g1', 'g1', 'g2'])
groupby_to_series_to_frame(df, np.mean, n_jobs=8, use_apply=True, by=df.index)
但是,有没有人想出如何并行化返回DataFrame的函数呢?如预期的那样,此代码对于rosetta
失败.
However, has anyone figured out how to parallelize a function that returns a DataFrame? This code fails for rosetta
, as expected.
def tmpFunc(df):
df['c'] = df.a + df.b
return df
df.groupby(df.index).apply(tmpFunc)
groupby_to_series_to_frame(df, tmpFunc, n_jobs=1, use_apply=True, by=df.index)
推荐答案
这似乎可行,尽管它确实应该内置在熊猫中
This seems to work, although it really should be built in to pandas
import pandas as pd
from joblib import Parallel, delayed
import multiprocessing
def tmpFunc(df):
df['c'] = df.a + df.b
return df
def applyParallel(dfGrouped, func):
retLst = Parallel(n_jobs=multiprocessing.cpu_count())(delayed(func)(group) for name, group in dfGrouped)
return pd.concat(retLst)
if __name__ == '__main__':
df = pd.DataFrame({'a': [6, 2, 2], 'b': [4, 5, 6]},index= ['g1', 'g1', 'g2'])
print 'parallel version: '
print applyParallel(df.groupby(df.index), tmpFunc)
print 'regular version: '
print df.groupby(df.index).apply(tmpFunc)
print 'ideal version (does not work): '
print df.groupby(df.index).applyParallel(tmpFunc)
这篇关于 pandas 分组后并行应用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!