考虑以下简化的示例数据帧 df:

Consider the following reduced example data frame df:

Department     CustomerID    Date          Price     MenswearDemand  HomeDemand
0    Menswear      418089    2019-04-18    199             199           0
1    Menswear      613573    2019-04-24    199             199           0
2    Menswear      161840    2019-04-25    199             199           0
3    Menswear     2134926    2019-04-29    199             199           0
4    Menswear      984801    2019-04-30     19              19           0
5        Home      398555    2019-01-27     52               0          52
6    Menswear      682906    2019-02-03     97              97           0
7    Menswear      682906    2019-02-03     97              97           0
8    Menswear      923491    2019-02-09     80              80           0
9    Menswear     1098782    2019-02-25    258             258           0
10   Menswear      721696    2019-03-25     12              12           0
11   Menswear      695706    2019-04-10    129             129           0
12  Underwear      637026    2019-01-18    349               0           0
13  Underwear      205997    2019-01-25    279               0           0
14  Underwear      787984    2019-02-01     27               0           0
15  Underwear      318256    2019-02-01    279               0           0
16  Underwear      570454    2019-02-14    262               0           0
17  Underwear     1239118    2019-02-28    279               0           0
18       Home     1680791    2019-04-04   1398               0        1398

我想根据 'CustomerID' 对这些数据进行分组,然后:

I want to group this data based on 'CustomerID' and then:

  1. 将购买日期 'Date' 转换为截止日期之前的天数,即 '2021-01-01'.这只是从客户最近一次购买到 '2021-01-01' 的时间.
  2. 对所有剩余的 Demand 列求和,在此示例中仅 'MenswearDemand''HomeDemand'.
  1. Turn the purchase date 'Date' into number of days until a cutoff - date, which is '2021-01-01'. This is just the time from the customers most recent purchase till '2021-01-01'.
  2. Sum over all the remaining Demand-columns, in this example only 'MenswearDemand' and 'HomeDemand'.


            Date  MenswearDemand  HomeDemand
161840         6             199           0
205997        96               0           0
318256        89               0           0
398555        94               0          52
418089        13             199           0
570454        76               0           0
613573         7             199           0
637026       103               0           0
682906        87             194           0
695706        21             129           0
721696        37              12           0
787984        89               0           0
923491        81              80           0
984801         1              19           0
1098782       65             258           0
1239118       62               0           0
1680791       27               0        1398
2134926        2             199           0


This is how I managed to sovle this:

df['Date'] = pd.to_datetime(df['Date'])
cutoffDate = df['Date'].max() + dt.timedelta(days = 1)
newdf = df.groupby('CustomerID').agg({'Date': lambda x: (cutoffDate - x.max()).days,
                                      'MenswearDemand': lambda x: x.sum(),
                                      'HomeDemand': lambda x: x.sum()})

然而,实际上我得到了大约 1500 万行和 30 个需求列.我真的不想每次都在我的聚合函数中写所有那些 'DemandColumn': lambda x: x.sum() ,因为它们都应该被求和.有没有更好的方法来做到这一点?比如传入一个想要对其进行特定操作的列子集的数组?

However, in reality I got about 15 million rows and 30 demand columns. I really don't want to write all those 'DemandColumn': lambda x: x.sum() in my aggregate function every time, since they all should be summed. Is there a better way of doing this? Like passing in an array of the subset of columns that one wants to do a particular operation on?


如果你事先知道你的列名,你可以在传递给 agg 函数之前构建字典.

If you know your column names in advance you can build the dictionary before passing to the agg function.

cutoffDate = df['Date'].max() + dt.timedelta(days=1)

agg_dict = {'Date': lambda x: (cutoffDate - x.max()).days}

DemandColumns = ['MenswearDemand', 'HomeDemand']
f = lambda x: x.sum()
agg_dict.update({col_name: f for col_name in DemandColumns})

newdf = df.groupby('CustomerID').agg(agg_dict)

另一种选择(知道列名,前面例子中的DemandColumns)是先用agg函数计算Date列,然后使用 filter 函数将所需列的列表作为 items 参数传递以仅保留那些确切的列.

Another option (knowing the column names, DemandColumns in the previous example) is to first use the agg function to calculate the Date column, and then use the filter function passing the list of desired columns as the items argument to keep only those exact columns.

cutoffDate = df['Date'].max() + dt.timedelta(days=1)
groups = df.groupby('CustomerID')
newdf = groups.agg(lambda x: (cutoffDate - x.max()).days)
newdf = pd.concat([newdf, groups.apply(lambda x: x.filter(items=DemandColumns).agg(sum))], axis=1)

如果所需的列 (DemandColumns) 遵循给定的模式,您可以排除列表创建并使用带有 regexfilter 函数争论.在这种情况下,您可以使用正则表达式 '.*Demand$' 返回以 Demand 字符串结尾的所有列.

If the desired columns (DemandColumns) follow a given pattern, you can exclude the list creation and use the filter function with the regex argument. In this case, you could use the regex '.*Demand$' to return all columns that end with the Demand string.

newdf = pd.concat([newdf, groups.apply(lambda x: x.filter(regex='.*Demand$').agg(sum))], axis=1)

