本文介绍了 pandas 中多索引数据框的累积百分比的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我想为熊猫中的多索引数据帧计算累积百分比,但无法使其正常工作.
I want to compute a cumulative percentage for a multi index dataframe in pandas and just can not get it to work.
import pandas as pd
to_df = {'domain': {(12, 12): 2, (14, 14): 1, (15, 15): 2, (15, 17): 2, (17, 17): 1},
'time': {(12, 12): 1, (14, 14): 1, (15, 15): 2, (15, 17): 1, (17, 17): 1},
'weight': {(12, 12): 3,
(14, 14): 4,
(15, 15): 1,
(15, 17): 2,
(17, 17): 5}}
df = pd.DataFrame.from_dict(to_df)
domain time weight
12 12 2 1 3
14 14 1 1 4
15 15 2 2 1
17 2 1 2
17 17 1 1 5
df = df.groupby(['time', 'domain']).apply(
pd.DataFrame.sort_values, 'weight', ascending=True)
cumsum()可以正常工作
cumsum() works as intended
df["cum_sum_time_domain"] = df.groupby(['time', 'domain'])['weight'].cumsum()
domain time weight cum_sum_time_domain
time domain
1 1 14 14 1 1 4 4
17 17 1 1 5 9
2 15 17 2 1 2 2
12 12 2 1 3 5
2 2 15 15 2 2 1 1
运行命令本身确实有效
df.groupby(['time', 'domain']).weight.sum()
df.groupby(['time', 'domain'])['weight'].sum()
但是,两个作业突然都产生了"NaN"
however both assignments suddenly yield 'NaNs'
df["sum_time_domain"] = df.groupby(['time', 'domain']).weight.sum()
df
df["sum_time_domain"] = df.groupby(['time', 'domain'])['weight'].sum()
df
将两者合并会产生错误:未实现在多索引上合并一个以上级别的重叠"
combining the two gives error: 'merging with more than one level overlap on a multi-index is not implemented'
df["cum_perc_time_domain"] = 100 * df.groupby(['time', 'domain'])['weight'].cumsum() / df.groupby(
['time', 'domain'])['weight'].sum()
推荐答案
我认为您需要 transform
和sum
.另外,由于不必对groupby
进行排序,请仅使用 sort_values
:
I think you need transform
with sum
. Also for sorting groupby
is not necessary, use only sort_values
:
df = df.sort_values(['time','domain','weight'])
print (df.groupby(['time', 'domain']).weight.transform('sum'))
14 14 9
17 17 9
15 17 5
12 12 5
15 15 1
Name: weight, dtype: int64
df["cum_perc_time_domain"] = 100 * df.groupby(['time', 'domain'])['weight'].cumsum() /
df.groupby(['time', 'domain']).weight.transform('sum')
print (df)
domain time weight cum_perc_time_domain
14 14 1 1 4 44.444444
17 17 1 1 5 100.000000
15 17 2 1 2 40.000000
12 12 2 1 3 100.000000
15 15 2 2 1 100.000000
这篇关于 pandas 中多索引数据框的累积百分比的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!