使用分组计算过去和未来特定事件的发生次数

本文介绍了使用分组计算过去和未来特定事件的发生次数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述 29岁程序员，3月因学历无情被辞！这个问题是我发布的一个问题的修改 here这里我在不同日子发生特定类型的事件，但这次它们被分配给多个用户，例如： $ $ $ $ $ $ $ $ $ $ df = data.frame（user_id = c（rep（1：2，each = 5））， cancelled_order = c（rep（c（0 ，1,1,0,0），2））， order_date = as.Date（c（'2015-01-28'，'2015-01-31'，'2015-02-08'， '2015-02-23'，'2015-03-23'，'2015-01-25'，'2015-01-28'，'2015-02-06'，'2015-02-21 '，'2015-03-26'））） user_id cancelled_order order_date 1 0 2015-01-28 1 1 2015-01-31 1 1 2015-02-08 1 0 2015-02-23 1 0 2015-03-23 2 0 2015-01-25 2 1 2015-01-28 2 1 2015-02-06 2 0 2015-02-21 2 0 2015-03-26 我想计算一下 1）每个客户将在未来x天内取消的订单数量（例如， 7,14），不包括当前和每个客户已取消订单的数量在过去的x天（例如7,14），排除当前的。所需的输出如下所示：解决方案 user_id cancelled_order order_date plus14减去14 1 0 2015-01-28 2 0 1 1 2015-01-31 1 0 1 1 2015-02-08 0 1 1 0 2015-02-23 0 0 1 0 2015-03-23 0 0 2 0 2015-01-25 2 0 2 1 2015-01-28 1 0 2 1 2015-02-06 0 1 2 0 2015-02-21 0 0 2 0 2015-03-26 0 0 解决方案完全符合此目的，由@ joel.wilson使用 data.table library（data.table） vec setDT（df）[， paste0（x，vec）：= lapply（vec，function（i）sum（df $ cancelled_order [between（df $ order_date， order_date， order_date + i，＃this部分可以改变以反映过去的日期范围 incbounds = FALSE）]））， by = order_date] 然而，它并没有考虑到 user_id 的分组。当我尝试通过将此分组添加为 by = c（user_id，order_date）或 by = list（user_id， order_date），它不起作用。这似乎是非常基本的东西，关于如何解决这个细节的任何提示？另外，请记住，即使它不是基于上述代码或 data.table 谢谢！解决方案以下是一种方法： library（data.table） orderDT = with（df，data.table（ id = user_id，completed =！cancelled_order，d = order_date）） vec = list（minus = 14L，plus = 14L） orderDT [，c（dplus，dminus ）：=。（ orderDT [！（completed）] [orderDT [，。（id，d_plus = d + vec $ plus，d_tom = d + 1L）]，on =。（id，d ， orderDT [！（completed）] [orderDT [，。（id，d_minus = d - vec $ minus ，d_yest = d-1L）]，on =。（id，d> = d_minus，d ）] id完成d dplus dminus 1：1 TRUE 2015-01-28 2 0 2：1 FALSE 2015-01-31 1 0 3：1 FALSE 2015 -02-08 0 1 4：1 TRUE 2015-02-23 0 0 5：1 TRUE 2015-03-23 0 0 6：2 TRUE 2015-01-25 2 0 7：2 FALSE 2015 -01-28 1 0 8：2 FALSE 2015-02-06 0 1 9：2 TRUE 2015-02-21 0 0 10：2 TRUE 2015-03-26 0 0 （我发现OP的列名很麻烦，所以缩短了它们。）运作方式每列可以像 orderDT [！（completed）] [orderDT [，。（id，d_plus = d + vec） $ plus $ d $ t $ d $ 1 code code code code code $ d $ > 这可以通过简化步骤分解为： orderDT [！（completed）] [ orderDT [，。（id，d_plus = d + vec $ plus，d_tom = d + 1L）]， on =。（id，d < = d_plus，d> = d_tom）， .N， by = .EACHI] $ N ＃origina l版本 orderDT [！（已完成）] [ orderDT [，。（id，d_plus = d + vec $ plus，d_tom = d + 1L）]， on =。（id，d< = d_plus，d> = d_tom）， .N， by = .EACHI] ＃不提取N列计数 orderDT [！（completed）] [ orderDT [，。（id，d_plus = d + vec $ plus，d_tom = d + 1L）]， on =。（id ，d = d_tom）] ＃不创建N列的计数 orderDT [！（completed）] ＃don' t做连接 orderDT [，。（id，d_plus = d + vec $ plus，d_tom = d + 1L）] ＃查看连接中使用的第二个表这使用非平等连接，以不等式定义日期范围。有关更多详细信息，请参阅通过键入？data.table 找到的文档页面。 this question is a modification of a problem I posted here where I have occurrences of a specific type on different days, but this time they are assigned to multiple users, for example:df = data.frame(user_id = c(rep(1:2, each=5)), cancelled_order = c(rep(c(0,1,1,0,0), 2)), order_date = as.Date(c('2015-01-28', '2015-01-31', '2015-02-08', '2015-02-23', '2015-03-23', '2015-01-25', '2015-01-28', '2015-02-06', '2015-02-21', '2015-03-26')))user_id cancelled_order order_date 1 0 2015-01-28 1 1 2015-01-31 1 1 2015-02-08 1 0 2015-02-23 1 0 2015-03-23 2 0 2015-01-25 2 1 2015-01-28 2 1 2015-02-06 2 0 2015-02-21 2 0 2015-03-26I'd like to calculate1) the number of cancelled orders that each customer is going to have in the next x days (e.g. 7, 14), excluding the current one and1) the number of cancelled orders that each customer had in the past x days (e.g. 7, 14) , excluding the current one.The desired output would look like this:solutionuser_id cancelled_order order_date plus14 minus14 1 0 2015-01-28 2 0 1 1 2015-01-31 1 0 1 1 2015-02-08 0 1 1 0 2015-02-23 0 0 1 0 2015-03-23 0 0 2 0 2015-01-25 2 0 2 1 2015-01-28 1 0 2 1 2015-02-06 0 1 2 0 2015-02-21 0 0 2 0 2015-03-26 0 0The solution that is perfectly fit for this purpose was presented by @joel.wilson using data.tablelibrary(data.table)vec <- c(14, 30) # Specify desired rangessetDT(df)[, paste0("x", vec) := lapply(vec, function(i) sum(df$cancelled_order[between(df$order_date, order_date, order_date + i, # this part can be changed to reflect the past date ranges incbounds = FALSE)])), by = order_date]However, it does not take into account grouping by user_id. When I tried to modify the formula by adding this grouping as by = c("user_id", "order_date") or by = list(user_id, order_date), it did not work. It seems it is something very basic, any hints on how to get around this detail?Also, keep in mind that I'm after a solution that works, even if it is not based on the above code or data.table at all!Thanks! 解决方案 Here's one way:library(data.table)orderDT = with(df, data.table(id = user_id, completed = !cancelled_order, d = order_date))vec = list(minus = 14L, plus = 14L)orderDT[, c("dplus", "dminus") := .( orderDT[!(completed)][orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)], on=.(id, d <= d_plus, d >= d_tom), .N, by=.EACHI]$N , orderDT[!(completed)][orderDT[, .(id, d_minus = d - vec$minus, d_yest = d - 1L)], on=.(id, d >= d_minus, d <= d_yest), .N, by=.EACHI]$N)] id completed d dplus dminus 1: 1 TRUE 2015-01-28 2 0 2: 1 FALSE 2015-01-31 1 0 3: 1 FALSE 2015-02-08 0 1 4: 1 TRUE 2015-02-23 0 0 5: 1 TRUE 2015-03-23 0 0 6: 2 TRUE 2015-01-25 2 0 7: 2 FALSE 2015-01-28 1 0 8: 2 FALSE 2015-02-06 0 1 9: 2 TRUE 2015-02-21 0 010: 2 TRUE 2015-03-26 0 0(I found OP's column names cumbersome and so shortened them.)How it worksEach of the columns can be run on its own, likeorderDT[!(completed)][orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)], on=.(id, d <= d_plus, d >= d_tom), .N, by=.EACHI]$NAnd this can be broken down into steps by simplifying:orderDT[!(completed)][ orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)], on=.(id, d <= d_plus, d >= d_tom), .N, by=.EACHI]$N# original versionorderDT[!(completed)][ orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)], on=.(id, d <= d_plus, d >= d_tom), .N, by=.EACHI]# don't extract the N column of countsorderDT[!(completed)][ orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)], on=.(id, d <= d_plus, d >= d_tom)]# don't create the N column of countsorderDT[!(completed)]# don't do the joinorderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)]# see the second table used in the joinThis uses a "non-equi" join, taking inequalities to define the date ranges. For more details, see the documentation page found by typing ?data.table. 这篇关于使用分组计算过去和未来特定事件的发生次数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！上岸，阿里云！