我有一个具有以下结构的数据框
Debtor ID | AccountRating | AccountRatingDate | AmountOutstanding |AmountPastDue
John Snow Closed 2017-03-01 0 0
John Snow Delayed 2017-04-22 2000 500
John Snow Closed 2017-05-23 0 0
John Snow Delayed 2017-07-15 6000 300
Sarah Parker Closed 2017-02-01 0 0
Edward Hall Closed 2017-05-01 0 0
Douglas Core Delayed 2017-01-01 1000 200
Douglas Core Delayed 2017-06-01 1000 400
我想要实现的是
Debtor ID | Incidents of delay | TheMostRecentOutstanding | TheMostRecentPastDue
John Snow 2 6000 300
Sarah Parker 0 0 0
Edward Hall 0 0 0
Douglas Core 2 1000 400
计算延迟事件非常简单
df_account["pastDuebool"] = df_account['amtPastDue'] > 0
new_df = pd.DataFrame(index = df_account.groupby("Debtor ID").groups.keys())
new_df['Incidents of delay'] = df_account.groupby("Debtor ID")["pastDuebool"].sum()
我正在努力提取最新的未完成和过期的 amonts。我的代码是这样的
new_df["TheMostRecentOutstanding"] = df_account.loc[df_account[df_account["AccountRating"]=='Delayed'].groupby('Debtor ID')["AccountRatingDate"].idxmax(),"AmountOutstanding"]
new_df["TheMostRecentPastDue"] = df_account.loc[df_account[df_account["AccountRating"]=='Delayed'].groupby('Debtor ID')["AccountRatingDate"].idxmax(),"AmountPastDue"]
但是它们返回带有所有 NaN 值的系列。请帮助我,我在这里做错了什么?
最佳答案
你可以试试这个:
df.sort_values('AccountRatingDate')\
.query('AccountRating == "Delayed"')\
.groupby('Debtor ID')[['AccountRating','AmountOutstanding','AmountPastDue']]\
.agg({'AccountRating':'count',
'AmountOutstanding':'last',
'AmountPastDue':'last'})\
.reindex(df['Debtor ID'].unique(), fill_value=0)\
.reset_index()
输出:
Debtor ID AccountRating AmountOutstanding AmountPastDue
0 John Snow 2 6000 300
1 Sarah Parker 0 0 0
2 Edward Hall 0 0 0
3 Douglas Core 2 1000 400
细节:
最后记录。
'延迟'
指示如何聚合每一列的字典
没有任何延误
而且,您可以使用
rename
和字典进行列重命名:df.sort_values('AccountRatingDate')\
.query('AccountRating == "Delayed"')\
.groupby('Debtor ID')[['AccountRating','AmountOutstanding','AmountPastDue']]\
.agg({'AccountRating':'count',
'AmountOutstanding':'last',
'AmountPastDue':'last'})\
.reindex(df['Debtor ID'].unique(), fill_value=0)\
.rename(columns={'AccoutRating':'Incidents of delay',
'AmountOutstanding':'TheMostRecentOutstanding',
'AmountPastDue':'TheMostRecentPastDue'})\
.reset_index()
输出:
Debtor ID AccountRating TheMostRecentOutstanding TheMostRecentPastDue
0 John Snow 2 6000 300
1 Sarah Parker 0 0 0
2 Edward Hall 0 0 0
3 Douglas Core 2 1000 400
关于python - 从 Pandas 中的组中获取最新值,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/54908602/