我有一个熊猫数据框,例如前四列构成一个多索引:

import pandas as pd
data = [[1, 'A', 1, 0, 10],
        [1, 'A', 0, 1, 10],
        [1, 'A', 1, 1, 10],
        [1, 'A', 0, 0, 10],
        [1, 'B', 1, 0, 10],
        [1, 'B', 0, 1, 10],
        [1, 'B', 1, 1, 10],
        [1, 'B', 0, 0, 10]]
cols = ['user_id','type','flag1','flag2','cnt']
df = pd.DataFrame(data,columns = cols)
df = df.set_index(['user_id','type','flag1','flag2'])
print df

user_id    type    flag1    flag2    cnt
________________________________________
1          A       1        0        10
1          A       0        1        10
1          A       1        1        10
1          A       0        0        10
1          B       1        0        10
1          B       0        1        10
1          B       1        1        10
1          B       0        0        10


我想遍历索引值以获得每个唯一索引值的分组总计数,如下所示:

user_id    type    flag1    flag2    cnt
________________________________________
1          ALL     ALL      ALL      80
1          ALL     ALL      0        40
1          ALL     ALL      1        40
1          ALL     1        ALL      40
1          ALL     0        ALL      40
1          A       ALL      ALL      40
1          B       ALL      ALL      40
1          A       ALL      0        20
1          A       ALL      1        20
1          B       ALL      0        20
1          B       ALL      1        20
1          A       1        ALL      20
1          A       0        ALL      20
1          B       1        ALL      20
1          B       0        ALL      20
1          A       1        0        10
1          A       0        1        10
1          A       1        1        10
1          A       0        0        10
1          B       1        0        10
1          B       0        1        10
1          B       1        1        10
1          B       0        0        10


我可以使用查询和groupby轻松生成每个组,但理想情况下,我希望能够遍历任意数量的索引列以获取cnt列的总和。

最佳答案

与以前的答案类似,这是使用itertoolsgroupby的更简化的方法:

from itertools import chain, combinations
indices = ['user_id','type','flag1','flag2']
powerset = list(chain.from_iterable(combinations(indices, r) for r in range(1,len(indices)+1)))

master = (pd.concat([df.reset_index().groupby(p, as_index=False).sum()
                     for p in powerset if p[0] == "user_id"])[cols]
            .replace([None,4,2], "ALL")
            .sort_values("cnt", ascending=False))


输出:

user_id type flag1 flag2  cnt
0        1  ALL   ALL   ALL   80
0        1    A   ALL   ALL   40
1        1    B   ALL   ALL   40
0        1  ALL     0   ALL   40
1        1  ALL     1   ALL   40
0        1  ALL   ALL     0   40
1        1  ALL   ALL     1   40
3        1  ALL     1     1   20
2        1  ALL     1     0   20
1        1  ALL     0     1   20
0        1  ALL     0     0   20
3        1    B     1     1   20
2        1    B     1     0   20
1        1    A     1     1   20
0        1    A     1     0   20
3        1    B     1     1   20
2        1    B     0     1   20
1        1    A     1     1   20
0        1    A     0     1   20
0        1    A     0     0   10
1        1    A     0     1   10
2        1    A     1     0   10
3        1    A     1     1   10
4        1    B     0     0   10
5        1    B     0     1   10
6        1    B     1     0   10
7        1    B     1     1   10


powerset计算直接来自itertools文档。

关于python - 在MultiIndex级别遍历 Pandas 并进行分组依据以获取总计,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/44080517/

10-12 18:17