我正在寻找下面的汇总-理想地是一步。聚合的列需要使用不同的过滤器进行计算,我想到了两种方法来实现这一点(请参见函数f1
和f2
)。我认为定义索引(如在f2
中)将加快该过程,但是却恰好相反-聚合花费大约2-3倍的时间,而与数据帧的行数无关。
为什么会这样呢?我认为.loc
是推荐的方法。另外,还有第三种方法(比f1
更快)吗?我正在使用Python 3.6.4。
import numpy as np
import pandas as pd
from collections import OrderedDict
import time
N = 10**5
df_big = pd.DataFrame({'grp': np.array(list(range(1,11)) * N),
'vals': np.random.randint(0,100, 10*N),
'var1': np.random.randint(10,30, 10*N)})
def f1(x):
d = OrderedDict()
d['vals_sum_1'] = np.sum(x['vals'][x['var1'] > 15])
d['vals_mean_1'] = np.mean(x['vals'][x['var1'] > 15])
d['vals_median_1'] = np.median(x['vals'][x['var1'] > 15])
d['vals_sum_2'] = np.sum(x['vals'][x['var1'] > 20])
d['vals_mean_2'] = np.mean(x['vals'][x['var1'] > 20])
d['vals_median_2'] = np.median(x['vals'][x['var1'] > 20])
return pd.Series(d)
def f2(x):
d = OrderedDict()
idx1 = x.loc[x['var1'] > 15].index
idx2 = x.loc[x['var1'] > 20].index
d['vals_sum_1'] = np.sum(x['vals'][idx1])
d['vals_mean_1'] = np.mean(x['vals'][idx1])
d['vals_median_1'] = np.median(x['vals'][idx1])
d['vals_sum_2'] = np.sum(x['vals'][idx2])
d['vals_mean_2'] = np.mean(x['vals'][idx2])
d['vals_median_2'] = np.median(x['vals'][idx2])
return pd.Series(d)
start_time = time.time()
df_grp_1 = df_big.groupby('grp').apply(f1).reset_index()
gr1_time = time.time()
df_grp_2 = df_big.groupby('grp').apply(f2).reset_index()
gr2_time = time.time()
print("Using aggf1: %s seconds ---" % (gr1_time - start_time))
print("Using aggf2: %s seconds ---" % (gr2_time - gr1_time))
最佳答案
有很多重复的操作。通过删除重复的索引,可以看到将因子提高了约2倍:
def f3(df):
g1 = df.loc[df['var1'] > 15].groupby('grp')['vals']
g2 = df.loc[df['var1'] > 20].groupby('grp')['vals']
res = pd.DataFrame({'grp': df['grp'].unique()})
for i, j in enumerate([g1, g2], 1):
res['vals_sum_'+str(i)] = res['grp'].map(j.sum())
res['vals_mean_'+str(i)] = res['grp'].map(j.mean())
res['vals_median_'+str(i)] = res['grp'].map(j.median())
return res
%timeit df_big.groupby('grp').apply(f1).reset_index() # 349ms
%timeit df_big.groupby('grp').apply(f2).reset_index() # 433ms
%timeit f3(df_big) # 183ms
关于python - 优化-数据帧聚合在聚合过程中将使用不同的过滤器:df.loc吗?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/49236188/