问题描述
我有以下数据框:
df = pd.DataFrame([
(1, 1, 'term1'),
(1, 2, 'term2'),
(1, 1, 'term1'),
(1, 1, 'term2'),
(2, 2, 'term3'),
(2, 3, 'term1'),
(2, 2, 'term1')
], columns=['id', 'group', 'term'])
我想按id
和group
对其进行分组,并计算该ID(分组对)的每个术语的数量.
I want to group it by id
and group
and calculate the number of each term for this id, group pair.
所以最终我会得到这样的东西:
So in the end I am going to get something like this:
我可以通过用df.iterrows()
遍历所有行并创建一个新的数据框来实现所需的功能,但这显然效率不高. (如果有帮助,我会事先知道所有术语的列表,其中约有10个.)
I was able to achieve what I want by looping over all the rows with df.iterrows()
and creating a new dataframe, but this is clearly inefficient. (If it helps, I know the list of all terms beforehand and there are ~10 of them).
看来我必须分组然后计算值,所以我尝试了使用df.groupby(['id', 'group']).value_counts()
的方法,该方法不起作用,因为 value_counts 对groupby系列进行操作,而不是对数据框进行操作.
It looks like I have to group by and then count values, so I tried that with df.groupby(['id', 'group']).value_counts()
which does not work because value_counts operates on the groupby series and not a dataframe.
无论如何我都可以不循环而实现?
Anyway I can achieve this without looping?
推荐答案
我使用groupby
和size
df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0)
1,000,000行
1,000,000 rows
df = pd.DataFrame(dict(id=np.random.choice(100, 1000000),
group=np.random.choice(20, 1000000),
term=np.random.choice(10, 1000000)))
这篇关于数据框 pandas 的Groupby值计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!