我有一个数据框,其中包含一个组ID,两个距离度量(经度/纬度类型度量)和一个值。对于给定的一组距离,我想找到附近其他组的数量以及附近其他组的平均值。
我已经编写了以下代码,但是它的效率非常低,以至于无法在合理的时间内完成非常大的数据集。计算附近的零售商很快。但是,计算附近零售商的平均值非常慢。有没有更好的方法可以提高效率?
distances = [1,2]
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)),
columns=['Group','Dist1','Dist2','Value'])
# get one row per group, with the two distances for each row
df_groups = df.groupby('Group')[['Dist1','Dist2']].mean()
# create KDTree for quick searching
tree = cKDTree(df_groups[['Dist1','Dist2']])
# find points within a given radius
for i in distances:
closeby = tree.query_ball_tree(tree, r=i)
# put into density column
df_groups['groups_within_' + str(i) + 'miles'] = [len(x) for x in closeby]
# get average values of nearby groups
for idx, val in enumerate(df_groups.index):
val_idx = df_groups.iloc[closeby[idx]].index.values
mean = df.loc[df['Group'].isin(val_idx), 'Value'].mean()
df_groups.loc[val, str(i) + '_mean_values'] = mean
# merge back to dataframe
df = pd.merge(df, df_groups[['groups_within_' + str(i) + 'miles',
str(i) + '_mean_values']],
left_on='Group',
right_index=True)
最佳答案
很明显,问题在于使用isin
方法索引主数据帧。随着数据帧长度的增加,必须进行更大的搜索。我建议您对较小的df_groups
数据框执行相同的搜索,然后计算更新的平均值。
df = pd.DataFrame(np.random.randint(0,100,size=(100000, 4)),
columns=['Group','Dist1','Dist2','Value'])
distances = [1,2]
# get means of all values and count, the totals for each sample
df_groups = df.groupby('Group')[['Dist1','Dist2','Value']].agg({'Dist1':'mean','Dist2':'mean',
'Value':['mean','count']})
# remove multicolumn index
df_groups.columns = [' '.join(col).strip() for col in df_groups.columns.values]
#Rename columns
df_groups.rename(columns={'Dist1 mean':'Dist1','Dist2 mean':'Dist2','Value mean':'Value','Value count':
'Count'},inplace=True)
# create KDTree for quick searching
tree = cKDTree(df_groups[['Dist1','Dist2']])
for i in distances:
closeby = tree.query_ball_tree(tree, r=i)
# put into density column
df_groups['groups_within_' + str(i) + 'miles'] = [len(x) for x in closeby]
#create column to look for subsets
df_groups['subs'] = [df_groups.index.values[idx] for idx in closeby]
#set this column to prep updated mean calculation
df_groups['ComMean'] = df_groups['Value'] * df_groups['Count']
#perform updated mean
df_groups[str(i) + '_mean_values'] = [(df_groups.loc[df_groups.index.isin(row), 'ComMean'].sum() /
df_groups.loc[df_groups.index.isin(row), 'Count'].sum()) for row in df_groups['subs']]
df = pd.merge(df, df_groups[['groups_within_' + str(i) + 'miles',
str(i) + '_mean_values']],
left_on='Group',
right_index=True)
和均值的公式为(m1 * n1 + m2 * n2)/(n1 + n2)
old setup
100000 rows
%timeit old(df)
1 loop, best of 3: 694 ms per loop
1000000 rows
%timeit old(df)
1 loop, best of 3: 6.08 s per loop
10000000 rows
%timeit old(df)
1 loop, best of 3: 6min 13s per loop
新设置
100000 rows
%timeit new(df)
10 loops, best of 3: 136 ms per loop
1000000 rows
%timeit new(df)
1 loop, best of 3: 525 ms per loop
10000000 rows
%timeit new(df)
1 loop, best of 3: 4.53 s per loop
关于python - 加快附近群体的计算速度?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/45373501/