我有以下数据帧:
item response
1 A
1 A
1 B
2 A
2 A
我想为一个项目添加一个包含最多给定响应的列。这应该导致:
item response mostGivenResponse
1 A A
1 A A
1 B A
2 C C
2 C C
我试过这样的事情:
df["responseCount"] = df.groupby(["ItemCode", "Response"])["Response"].transform("count")
df["mostGivenResponse"] = df.groupby(['ItemCode'])['responseCount'].transform(max)
但是 mostGivenResponse 现在是响应的计数而不是响应本身。
最佳答案
使用 value_counts
并返回第一个索引值:
df["responseCount"] = (df.groupby("item")["response"]
.transform(lambda x: x.value_counts().index[0]))
print (df)
item response responseCount
0 1 A A
1 1 A A
2 1 B A
3 2 C C
4 2 C C
或
collections.Counter.most_common
:from collections import Counter
df["responseCount"] = (df.groupby("item")["response"]
.transform(lambda x: Counter(x).most_common(1)[0][0]))
print (df)
item response responseCount
0 1 A A
1 1 A A
2 1 B A
3 2 C C
4 2 C C
编辑:
问题在于一个或多个
NaN
s only 组,解决方案是使用 if-else
过滤:print (df)
item response
0 1 A
1 1 A
2 2 NaN
3 2 NaN
4 3 NaN
def f(x):
s = x.value_counts()
print (s)
A 2
Name: 1, dtype: int64
Series([], Name: 2, dtype: int64)
Series([], Name: 3, dtype: int64)
#return np.nan if s.empty else s.index[0]
return np.nan if len(s) == 0 else s.index[0]
df["responseCount"] = df.groupby("item")["response"].transform(f)
print (df)
item response responseCount
0 1 A A
1 1 A A
2 2 NaN NaN
3 2 NaN NaN
4 3 NaN NaN
关于python - Pandas:获取组中出现次数最多的字符串值,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/51288635/