python - 通过将列分组与 Pandas 来汇总数据框

我有一个数据框

id  store  val1    val2
1    abc    20      30
1    abc    20      40
1    qwe    78      45
2    dfd    34      45
2    sad    43      45

因此，我必须对id进行分组，并使用列total_store和unique stores和non-repeating_stores创建一个新的df，其中包含此类存储发生的次数。
我的最终输出应该是

id    total_store    unique stores    non-repeating_stores
1        3              2                   1
2        2              2                   2

我可以通过这样做来获得全部商店

df.groupby('id')['store'].count()

但是我如何得到别人并形成一个数据框

最佳答案

您可以使用groupby + agg。

df = df.groupby('id').store.agg(['count', 'nunique', \
                lambda x: x.drop_duplicates(keep=False).size])
df.columns = ['total_store', 'unique stores', 'non-repeating_stores']

df
    total_store  unique stores  non-repeating_stores
id
1             3              2                     1
2             2              2                     2

对于较早的熊猫版本，通过传递dict可以简化代码（在0.20及以后版本中不推荐使用）：

agg_funcs = {'total_stores' : 'count', 'unique_stores' : 'nunique',
         'non-repeating_stores' : lambda x: x.drop_duplicates(keep=False).size
}
df = df.groupby('id').store.agg(agg_funcs)

df
    total_stores  non-repeating_stores  unique_stores
id
1              3                     1              2
2              2                     2              2

作为对速度的略微改进，您可以采用as documented by jezrael的方式使用drop_duplicates的姐妹方法duplicated：

lambda x: (~x.duplicated(keep=False)).sum()

这将替换agg中的第三个功能，使大小为1000000的大数据的速度提高20％：

1 loop, best of 3: 7.31 s per loop

伏/秒

1 loop, best of 3: 5.19 s per loop

关于python - 通过将列分组与 Pandas 来汇总数据框，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/46464718/