具有操作和创建新列的数据框的复杂分组

本文介绍了具有操作和创建新列的数据框的复杂分组的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个问题，但找不到可以应用的好的答案.似乎比我想象的要复杂:

I have a question and was not able to find a good answer which I can apply. It seems to be more complex than I thought:

这是我当前的数据框df=

This is my current dataframedf=

[customerid, visit_number, date,        purchase_amount]
[1,          38,           01-01-2019,  40             ]
[1,          39,           01-03-2019,  20             ]
[2,          10,           01-02-2019,  60             ]
[2,          14,           01-05-2019,  0              ]
[3,          10,           01-01-2019,  5              ]

我正在寻找的是聚合这个表，我最终每 1 个客户有 1 行，并且还有来自原始的额外派生列，如下所示:

What I am looking for is to aggregate this table where I end up with 1 row per 1 customer and also with additional derived columns from the original like this:

df_new=

[customerid, visits,      days,              purchase_amount]
[1,          2,           3,                 60             ]
[2,          5,           4,                 60             ]
[3,          1,           1,                 5              ]

请注意，如果没有日期或访问可与用户进行比较，则这些指标将始终为 1(参见 customerid=3).

Note, that if there is no date or visit to compare against for a user, then those metrics will be always 1 (see for customerid=3).

就像我说的，我尝试了好几天四处寻找，但找不到太多帮助.希望有人能指导一下.非常感谢.

Like I said, I tried looking around for days but I cannot find much help. I hope someone can guide. Thank you very much.

推荐答案

您可以使用 groupby.agg:

import datetime
df['date']=pd.to_datetime(df['date'])
g=df.groupby('customerid')
df.index=df['customerid']
df_new=g.agg({'purchase_amount':'sum','visit_number':'diff','date':'diff'})
df_new=df_new.reset_index().sort_values('date').drop_duplicates('customerid').reset_index(drop=True)
df_new['visit_number']=df_new['visit_number']+1
df_new['date']=df_new['date']+pd.Timedelta('1 days')
df_new=df_new.rename(columns={'visit_number':'visits','date':'days'}).reindex(columns=['customerid','visits','days','purchase_amount'])
df_new['visits']=df_new['visits'].fillna(1)
df_new['days']=df_new['days'].fillna(pd.Timedelta('1 days'))
print(df_new)


     customerid  visits   days  purchase_amount
0           1     2.0   3 days               60
1           2     5.0   4 days               60
2           3     1.0   1 days                5

替代解决方案:

import datetime
df['date']=pd.to_datetime(df['date'])
g=df.groupby('customerid')
df.index=df['customerid']
df2=g.agg({'visit_number':'diff','date':'diff'})
df2=df2.loc[df2['visit_number'].notnull()]
df2['visit_number']=df2['visit_number']+1
df2['date']=df2['date']+pd.Timedelta('1 days')
df3=g.agg({'purchase_amount':'sum'})
df_new=pd.concat([df2,df3],sort=False,axis=1).rename(columns={'visit_number':'visits','date':'days'}).reset_index()
df_new['visits']=df_new['visits'].fillna(1)
df_new['days']=df_new['days'].fillna(pd.Timedelta('1 days'))
print(df_new)

这篇关于具有操作和创建新列的数据框的复杂分组的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！