问题描述
考虑以下简化的示例数据帧 df
:
Consider the following reduced example data frame df
:
Department CustomerID Date Price MenswearDemand HomeDemand
0 Menswear 418089 2019-04-18 199 199 0
1 Menswear 613573 2019-04-24 199 199 0
2 Menswear 161840 2019-04-25 199 199 0
3 Menswear 2134926 2019-04-29 199 199 0
4 Menswear 984801 2019-04-30 19 19 0
5 Home 398555 2019-01-27 52 0 52
6 Menswear 682906 2019-02-03 97 97 0
7 Menswear 682906 2019-02-03 97 97 0
8 Menswear 923491 2019-02-09 80 80 0
9 Menswear 1098782 2019-02-25 258 258 0
10 Menswear 721696 2019-03-25 12 12 0
11 Menswear 695706 2019-04-10 129 129 0
12 Underwear 637026 2019-01-18 349 0 0
13 Underwear 205997 2019-01-25 279 0 0
14 Underwear 787984 2019-02-01 27 0 0
15 Underwear 318256 2019-02-01 279 0 0
16 Underwear 570454 2019-02-14 262 0 0
17 Underwear 1239118 2019-02-28 279 0 0
18 Home 1680791 2019-04-04 1398 0 1398
我想根据 'CustomerID'
对这些数据进行分组,然后:
I want to group this data based on 'CustomerID'
and then:
- 将购买日期
'Date'
转换为截止日期之前的天数,即'2021-01-01'
.这只是从客户最近一次购买到'2021-01-01'
的时间. - 对所有剩余的 Demand 列求和,在此示例中仅
'MenswearDemand'
和'HomeDemand'
.
- Turn the purchase date
'Date'
into number of days until a cutoff - date, which is'2021-01-01'
. This is just the time from the customers most recent purchase till'2021-01-01'
. - Sum over all the remaining Demand-columns, in this example only
'MenswearDemand'
and'HomeDemand'
.
我应该得到的结果是:
Date MenswearDemand HomeDemand
CustomerID
161840 6 199 0
205997 96 0 0
318256 89 0 0
398555 94 0 52
418089 13 199 0
570454 76 0 0
613573 7 199 0
637026 103 0 0
682906 87 194 0
695706 21 129 0
721696 37 12 0
787984 89 0 0
923491 81 80 0
984801 1 19 0
1098782 65 258 0
1239118 62 0 0
1680791 27 0 1398
2134926 2 199 0
这就是我设法解决这个问题的方法:
This is how I managed to sovle this:
df['Date'] = pd.to_datetime(df['Date'])
cutoffDate = df['Date'].max() + dt.timedelta(days = 1)
newdf = df.groupby('CustomerID').agg({'Date': lambda x: (cutoffDate - x.max()).days,
'MenswearDemand': lambda x: x.sum(),
'HomeDemand': lambda x: x.sum()})
然而,实际上我得到了大约 1500 万行和 30 个需求列.我真的不想每次都在我的聚合函数中写所有那些 'DemandColumn': lambda x: x.sum()
,因为它们都应该被求和.有没有更好的方法来做到这一点?比如传入一个想要对其进行特定操作的列子集的数组?
However, in reality I got about 15 million rows and 30 demand columns. I really don't want to write all those 'DemandColumn': lambda x: x.sum()
in my aggregate function every time, since they all should be summed. Is there a better way of doing this? Like passing in an array of the subset of columns that one wants to do a particular operation on?
推荐答案
如果你事先知道你的列名,你可以在传递给 agg
函数之前构建字典.
If you know your column names in advance you can build the dictionary before passing to the agg
function.
...
cutoffDate = df['Date'].max() + dt.timedelta(days=1)
agg_dict = {'Date': lambda x: (cutoffDate - x.max()).days}
DemandColumns = ['MenswearDemand', 'HomeDemand']
f = lambda x: x.sum()
agg_dict.update({col_name: f for col_name in DemandColumns})
newdf = df.groupby('CustomerID').agg(agg_dict)
另一种选择(知道列名,前面例子中的DemandColumns
)是先用agg
函数计算Date
列,然后使用 filter
函数将所需列的列表作为 items
参数传递以仅保留那些确切的列.
Another option (knowing the column names, DemandColumns
in the previous example) is to first use the agg
function to calculate the Date
column, and then use the filter
function passing the list of desired columns as the items
argument to keep only those exact columns.
...
cutoffDate = df['Date'].max() + dt.timedelta(days=1)
groups = df.groupby('CustomerID')
newdf = groups.agg(lambda x: (cutoffDate - x.max()).days)
newdf = pd.concat([newdf, groups.apply(lambda x: x.filter(items=DemandColumns).agg(sum))], axis=1)
如果所需的列 (DemandColumns
) 遵循给定的模式,您可以排除列表创建并使用带有 regex
的 filter
函数争论.在这种情况下,您可以使用正则表达式 '.*Demand$'
返回以 Demand
字符串结尾的所有列.
If the desired columns (DemandColumns
) follow a given pattern, you can exclude the list creation and use the filter
function with the regex
argument. In this case, you could use the regex '.*Demand$'
to return all columns that end with the Demand
string.
newdf = pd.concat([newdf, groups.apply(lambda x: x.filter(regex='.*Demand$').agg(sum))], axis=1)
这篇关于在 Python/Pandas 中执行不同操作的许多列有条件地聚合分组数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!