我有一个具有以下结构的数据框

Debtor ID    | AccountRating    | AccountRatingDate   | AmountOutstanding    |AmountPastDue
John Snow      Closed             2017-03-01            0                     0
John Snow      Delayed            2017-04-22            2000                  500
John Snow      Closed             2017-05-23            0                     0
John Snow      Delayed            2017-07-15            6000                  300
Sarah Parker   Closed             2017-02-01            0                     0
Edward Hall    Closed             2017-05-01            0                     0
Douglas Core   Delayed            2017-01-01            1000                  200
Douglas Core   Delayed            2017-06-01            1000                  400

我想要实现的是
Debtor ID    | Incidents of delay    | TheMostRecentOutstanding    | TheMostRecentPastDue
John Snow      2                       6000                          300
Sarah Parker   0                       0                             0
Edward Hall    0                       0                             0
Douglas Core   2                       1000                          400

计算延迟事件非常简单
df_account["pastDuebool"] = df_account['amtPastDue'] > 0
new_df = pd.DataFrame(index = df_account.groupby("Debtor ID").groups.keys())
new_df['Incidents of delay'] = df_account.groupby("Debtor ID")["pastDuebool"].sum()

我正在努力提取最新的未完成和过期的 amonts。我的代码是这样的
new_df["TheMostRecentOutstanding"] = df_account.loc[df_account[df_account["AccountRating"]=='Delayed'].groupby('Debtor ID')["AccountRatingDate"].idxmax(),"AmountOutstanding"]
new_df["TheMostRecentPastDue"] = df_account.loc[df_account[df_account["AccountRating"]=='Delayed'].groupby('Debtor ID')["AccountRatingDate"].idxmax(),"AmountPastDue"]

但是它们返回带有所有 NaN 值的系列。请帮助我,我在这里做错了什么?

最佳答案

你可以试试这个:

df.sort_values('AccountRatingDate')\
  .query('AccountRating == "Delayed"')\
  .groupby('Debtor ID')[['AccountRating','AmountOutstanding','AmountPastDue']]\
  .agg({'AccountRating':'count',
        'AmountOutstanding':'last',
        'AmountPastDue':'last'})\
  .reindex(df['Debtor ID'].unique(), fill_value=0)\
  .reset_index()

输出:
      Debtor ID  AccountRating  AmountOutstanding  AmountPastDue
0     John Snow              2               6000            300
1  Sarah Parker              0                  0              0
2   Edward Hall              0                  0              0
3  Douglas Core              2               1000            400

细节:
  • 首先按 AccountRatingDate 对数据框进行排序以获取最后一个日期作为
    最后记录。
  • 将数据框过滤为仅那些 AccountRatings 等于
    '延迟'
  • Groupby 债务人 ID 与列聚合,然后使用 agg
    指示如何聚合每一列的字典
  • 使用债务人 ID 的唯一值重新索引以将其填为零
    没有任何延误
  • 并且,重置索引。

  • 而且,您可以使用 rename 和字典进行列重命名:
    df.sort_values('AccountRatingDate')\
      .query('AccountRating == "Delayed"')\
      .groupby('Debtor ID')[['AccountRating','AmountOutstanding','AmountPastDue']]\
      .agg({'AccountRating':'count',
            'AmountOutstanding':'last',
            'AmountPastDue':'last'})\
      .reindex(df['Debtor ID'].unique(), fill_value=0)\
      .rename(columns={'AccoutRating':'Incidents of delay',
                       'AmountOutstanding':'TheMostRecentOutstanding',
                       'AmountPastDue':'TheMostRecentPastDue'})\
      .reset_index()
    

    输出:
          Debtor ID  AccountRating  TheMostRecentOutstanding  TheMostRecentPastDue
    0     John Snow              2                      6000                   300
    1  Sarah Parker              0                         0                     0
    2   Edward Hall              0                         0                     0
    3  Douglas Core              2                      1000                   400
    

    关于python - 从 Pandas 中的组中获取最新值,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/54908602/

    10-10 21:47