python - 当行可以属于多个组时，对Pandas Series或DataFrame的行进行分组

当groupby / Series对象的项目/行分别属于一组时，pandas的DataFrame方法非常有用。但是我遇到的情况是，每一行可以属于零个，一个或多个组。

带有一些假设数据的示例：

+--------+-------+----------------------+
| Item   | Count | Tags                 |
+--------+-------+----------------------+
| Apple  |     5 | ['fruit', 'red']     |
| Tomato |    10 | ['vegetable', 'red'] |
| Potato |     3 | []                   |
| Orange |    20 | ['fruit']            |
+--------+-------+----------------------+

根据“标签”列，苹果和番茄都属于两个组，马铃薯不属于任何组，橙色属于一个组。因此，按标签分组并汇总每个标签的计数应得出：

+-----------+-------+
| Tag       | Count |
+-----------+-------+
| fruit     |    25 |
| red       |    15 |
| vegetable |    10 |
+-----------+-------+

如何进行此操作？

最佳答案

用'Count'的长度爆炸'Tags'列

df.Count.repeat(df.Tags.str.len()).groupby(np.concatenate(df.Tags)).sum()

fruit        25
red          15
vegetable    10
Name: Count, dtype: int64

numpy.bincount和pandas.factorize

i, r = pd.factorize(np.concatenate(df.Tags))
c = np.bincount(i, df.Count.repeat(df.Tags.str.len()))

pd.Series(c.astype(df.Count.dtype), r)

fruit        25
red          15
vegetable    10
dtype: int64

通用解决方案

from collections import defaultdict
import pandas as pd

counts = [5, 10, 3, 20]
tags = [['fruit', 'red'], ['vegetable', 'red'], [], ['fruit']]
d = defaultdict(int)

for c, T in zip(counts, tags):
  for t in T:
    d[t] += c

print(pd.Series(d))
print()
print(pd.DataFrame([*d.items()], columns=['Tag', 'Count']))

fruit        25
red          15
vegetable    10
dtype: int64

         Tag  Count
0      fruit     25
1        red     15
2  vegetable     10

关于python - 当行可以属于多个组时，对Pandas Series或DataFrame的行进行分组，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/52101276/