问题描述
我的数据框由学生,日期和考试成绩组成.我想找到每个学生的最长时间,然后返回相应的行(最后,我对学生的最新成绩最感兴趣).我怎么能在熊猫里做到这一点?
My dataframe consists of students, dates, and test scores. I want to find the max date for each student and return the corresponding row (ultimately, I am most interested in the student's most recent score). How could I do this in pandas?
假设我的数据框如下所示(缩写版本):
Let's say my dataframe looks like this (an abbreviated version):
Student_id Date Score
Tina1 1/17/17 .95
John2 1/18/17 .8
Lia1 12/13/16 .845
John2 1/25/17 .975
Tina1 1/1/17 .78
Lia1 6/12/16 .89
这就是我想要的:
Student_id Date Score
Tina1 1/17/17 .95
Lia1 12/13/16 .845
John2 1/25/17 .975
我在SO上发现了这一点,但它给了我一个位置索引器出界错误.
I found this on SO but it gives me a positional indexers out-of-bounds error.
df.iloc[df.groupby('student_id').apply(lambda x: x['date'].idxmax())]
还有什么其他方法可以实现同一目标?
What are other approaches to achieve the same thing?
推荐答案
您可以按日期对数据框进行排序,然后使用groupby.tail
获取最新记录:
You can sort the data frame by Date and then use groupby.tail
to get the most recent record:
df.iloc[pd.to_datetime(df.Date, format='%m/%d/%y').argsort()].groupby('Student_id').tail(1)
#Student_id Date Score
#2 Lia1 12/13/16 0.845
#0 Tina1 1/17/17 0.950
#3 John2 1/25/17 0.975
或避免排序,请使用idxmax
(如果您没有重复的索引,则可以使用此功能):
Or avoid sorting, use idxmax
(this works if you don't have duplicated index):
df.loc[pd.to_datetime(df.Date, format='%m/%d/%y').groupby(df.Student_id).idxmax()]
# Student_id Date Score
#3 John2 1/25/17 0.975
#2 Lia1 12/13/16 0.845
#0 Tina1 1/17/17 0.950
这篇关于Groupby最大值并在pandas数据框中返回对应的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!