问题描述
我有一个名称为(person_name),日期和颜色(shirt_color)为列的数据集.
I have a dataset with name (person_name), day and color (shirt_color) as columns.
每个人在特定的一天都穿着某种颜色的衬衫.天数可以是任意的.
Each person wears a shirt with a certain color on a particular day. The number of days can be arbitrary.
例如输入:
name day color
----------------
John 1 White
John 2 White
John 3 Blue
John 4 Blue
John 5 White
Tom 2 White
Tom 3 Blue
Tom 4 Blue
Tom 5 Black
Jerry 1 Black
Jerry 2 Black
Jerry 4 Black
Jerry 5 White
我需要找到每个人最常用的颜色.
I need to find the most frequently used color by each person.
例如结果:
name color
-------------
Jerry Black
John White
Tom Blue
我正在执行以下操作来获取结果,该方法可以正常运行,但速度很慢:
I am performing the following operation to get the results, which works fine but is quite slow:
most_frquent_list = [[name, group.color.mode()[0]]
for name, group in data.groupby('name')]
most_frquent_df = pd.DataFrame(most_frquent_list, columns=['name', 'color'])
现在假设我有一个包含500万个唯一名称的数据集.进行上述操作的最佳/最快方法是什么?
Now suppose I have a dataset with 5 million unique names. What is the best/fastest way to perform the above operation?
推荐答案
Numpy的numpy.add.at
和pandas.factorize
这是为了快速.但是,我也尝试将其组织为可读性.
Numpy's numpy.add.at
and pandas.factorize
This is intended to be fast. However, I tried to organize it to be readable as well.
i, r = pd.factorize(df.name)
j, c = pd.factorize(df.color)
n, m = len(r), len(c)
b = np.zeros((n, m), dtype=np.int64)
np.add.at(b, (i, j), 1)
pd.Series(c[b.argmax(1)], r)
John White
Tom Blue
Jerry Black
dtype: object
groupby
,size
和idxmax
groupby
, size
, and idxmax
df.groupby(['name', 'color']).size().unstack().idxmax(1)
name
Jerry Black
John White
Tom Blue
dtype: object
name
Jerry Black
John White
Tom Blue
Name: color, dtype: object
Counter
¯\_(ツ)_/¯
Counter
¯\_(ツ)_/¯
from collections import Counter
df.groupby('name').color.apply(lambda c: Counter(c).most_common(1)[0][0])
name
Jerry Black
John White
Tom Blue
Name: color, dtype: object
这篇关于更快地执行pandas groupby操作的替代方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!