本文介绍了通过 pandas DataFrame分组并选择最常见的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含三个字符串列的数据框.我知道第三列中的唯一一个值对于前两个的每种组合都有效.要清理数据,我必须按数据帧的前两列进行分组,并为每种组合选择第三列的最常用值.

I have a data frame with three string columns. I know that the only one value in the 3rd column is valid for every combination of the first two. To clean the data I have to group by data frame by first two columns and select most common value of the third column for each combination.

我的代码:

import pandas as pd
from scipy import stats

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'],
                  'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
                  'Short name' : ['NY','New','Spb','NY']})

print source.groupby(['Country','City']).agg(lambda x: stats.mode(x['Short name'])[0])

最后一行代码不起作用,它显示键错误'Short name'",如果我尝试仅按城市分组,则会收到AssertionError.我该怎么办?

Last line of code doesn't work, it says "Key error 'Short name'" and if I try to group only by City, then I got an AssertionError. What can I do fix it?

推荐答案

您可以使用value_counts()获取计数序列,并获取第一行:

You can use value_counts() to get a count series, and get the first row:

import pandas as pd

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'],
                  'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
                  'Short name' : ['NY','New','Spb','NY']})

source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])

如果您想在.agg()中执行其他agg函数,试试这个.

In case you are wondering about performing other agg functions in the .agg()try this.

# Let's add a new col,  account
source['account'] = [1,2,3,3]

source.groupby(['Country','City']).agg(mod  = ('Short name', \
                                        lambda x: x.value_counts().index[0]),
                                        avg = ('account', 'mean') \
                                      )

这篇关于通过 pandas DataFrame分组并选择最常见的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 23:05