问题描述
这个问题是我发布的一个问题的修改,我在不同的日子出现特定类型,但这次他们被分配给多个用户,例如:
this question is a modification of a problem I posted here where I have occurrences of a specific type on different days, but this time they are assigned to multiple users, for example:
df = data.frame(user_id = c(rep(1:2, each=5)),
cancelled_order = c(rep(c(0,1,1,0,0), 2)),
order_date = as.Date(c('2015-01-28', '2015-01-31', '2015-02-08', '2015-02-23', '2015-03-23',
'2015-01-25', '2015-01-28', '2015-02-06', '2015-02-21', '2015-03-26')))
user_id cancelled_order order_date
1 0 2015-01-28
1 1 2015-01-31
1 1 2015-02-08
1 0 2015-02-23
1 0 2015-03-23
2 0 2015-01-25
2 1 2015-01-28
2 1 2015-02-06
2 0 2015-02-21
2 0 2015-03-26
我想计算
1)每个客户在接下来的x天将具有的取消订单数量(例如7,14),不包括目前的和
1) the number of cancelled orders that each customer is going to have in the next x days (e.g. 7, 14), excluding the current one and
1)每个客户已取消订单的数量在过去的x天(例如7,14),不包括当前的。
1) the number of cancelled orders that each customer had in the past x days (e.g. 7, 14) , excluding the current one.
所需的输出如下所示:
solution
user_id cancelled_order order_date plus14 minus14
1 0 2015-01-28 2 0
1 1 2015-01-31 1 0
1 1 2015-02-08 0 1
1 0 2015-02-23 0 0
1 0 2015-03-23 0 0
2 0 2015-01-25 2 0
2 1 2015-01-28 1 0
2 1 2015-02-06 0 1
2 0 2015-02-21 0 0
2 0 2015-03-26 0 0
非常适合此目的由@ joel.wilson使用 data.table
The solution that is perfectly fit for this purpose was presented by @joel.wilson using data.table
library(data.table)
vec <- c(14, 30) # Specify desired ranges
setDT(df)[, paste0("x", vec) :=
lapply(vec, function(i) sum(df$cancelled_order[between(df$order_date,
order_date,
order_date + i, # this part can be changed to reflect the past date ranges
incbounds = FALSE)])),
by = order_date]
但是,它不考虑通过 user_id
进行分组。当我尝试通过将这个分组添加为 by = c(user_id,order_date)
或 by = list(user_id, order_date)
,它没有工作。似乎这是一个非常基本的东西,任何提示如何解决这个细节?
However, it does not take into account grouping by user_id
. When I tried to modify the formula by adding this grouping as by = c("user_id", "order_date")
or by = list(user_id, order_date)
, it did not work. It seems it is something very basic, any hints on how to get around this detail?
另外,请注意,我的解决方案可以工作,即使它不是基于上述代码或 data.table
Also, keep in mind that I'm after a solution that works, even if it is not based on the above code or
data.table
at all!
谢谢!
推荐答案
p>这是一种方式:
library(data.table)
orderDT = with(df, data.table(id = user_id, completed = !cancelled_order, d = order_date))
vec = list(minus = 14L, plus = 14L)
orderDT[, c("dplus", "dminus") := .(
orderDT[!(completed)][orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)], on=.(id, d <= d_plus, d >= d_tom), .N, by=.EACHI]$N
,
orderDT[!(completed)][orderDT[, .(id, d_minus = d - vec$minus, d_yest = d - 1L)], on=.(id, d >= d_minus, d <= d_yest), .N, by=.EACHI]$N
)]
id completed d dplus dminus
1: 1 TRUE 2015-01-28 2 0
2: 1 FALSE 2015-01-31 1 0
3: 1 FALSE 2015-02-08 0 1
4: 1 TRUE 2015-02-23 0 0
5: 1 TRUE 2015-03-23 0 0
6: 2 TRUE 2015-01-25 2 0
7: 2 FALSE 2015-01-28 1 0
8: 2 FALSE 2015-02-06 0 1
9: 2 TRUE 2015-02-21 0 0
10: 2 TRUE 2015-03-26 0 0
(我发现OP的列名很麻烦,所以缩短了他们的价格。)
(I found OP's column names cumbersome and so shortened them.)
如何工作
每个列都可以自己运行,如
Each of the columns can be run on its own, like
orderDT[!(completed)][orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)], on=.(id, d <= d_plus, d >= d_tom), .N, by=.EACHI]$N
这可以通过简化来分解成步骤:
And this can be broken down into steps by simplifying:
orderDT[!(completed)][
orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)],
on=.(id, d <= d_plus, d >= d_tom),
.N,
by=.EACHI]$N
# original version
orderDT[!(completed)][
orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)],
on=.(id, d <= d_plus, d >= d_tom),
.N,
by=.EACHI]
# don't extract the N column of counts
orderDT[!(completed)][
orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)],
on=.(id, d <= d_plus, d >= d_tom)]
# don't create the N column of counts
orderDT[!(completed)]
# don't do the join
orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)]
# see the second table used in the join
这使用非Equi连接,使用不等式来定义日期范围。有关详细信息,请参阅通过键入
?data.table
找到的文档页。
This uses a "non-equi" join, taking inequalities to define the date ranges. For more details, see the documentation page found by typing
?data.table
.
这篇关于R:使用分组计算过去和未来的特定事件的出现次数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!