我有一个名为vistsPerDay的数据集,它看起来像这样,但是有405890行和10406个唯一的客户ID:
> CUST_ID Date
> 1 2013-09-19
> 1 2013-10-03
> 1 2013-10-08
> 1 2013-10-12
> 1 2013-10-20
> 1 2013-10-25
> 1 2013-11-01
> 1 2013-11-02
> 1 2013-11-08
> 1 2013-11-15
> 1 2013-11-23
> 1 2013-12-02
> 1 2013-12-04
> 1 2013-12-09
> 2 2013-09-16
> 2 2013-09-17
> 2 2013-09-18
我想做的是创建一个新变量,这是他们访问日期之间的滞后差异。以下是我当前使用的代码:
visitsPerDay <- visitsPerDay[order(visitsPerDay$CUST_ID), ]
cust_id <- 0
for (i in 1:nrow(visitsPerDay)) {
if (visitsPerDay$CUST_ID[i] != cust_id) {
cust_id <- visitsPerDay$CUST_ID[i]
visitsPerDay$MTBV <- NA
} else {
visitsPerDay$MBTV <- as.numeric(visitsPerDay$Date[i] - visitsPerDay$Date[i-1])
}
}
我觉得这样做肯定不是最有效的方法。有没有更好的方法来接近它谢谢!
最佳答案
这里有一个tapply
的方法:
# transform 'Date' to values of class 'Date' (maybe already done)
visitsPerDay$Date <- as.Date(visitsPerDay$Date)
visitsPerDay <- transform(visitsPerDay,
MBTV = unlist(tapply(Date,
CUST_ID,
FUN = function(x) c(NA,diff(x)))))
结果是:
CUST_ID Date MBTV
11 1 2013-09-19 NA
12 1 2013-10-03 14
13 1 2013-10-08 5
14 1 2013-10-12 4
15 1 2013-10-20 8
16 1 2013-10-25 5
17 1 2013-11-01 7
18 1 2013-11-02 1
19 1 2013-11-08 6
110 1 2013-11-15 7
111 1 2013-11-23 8
112 1 2013-12-02 9
113 1 2013-12-04 2
114 1 2013-12-09 5
21 2 2013-09-16 NA
22 2 2013-09-17 1
23 2 2013-09-18 1
编辑:更快的方法:
# transform 'Date' to values of class 'Date' (maybe already done)
visitsPerDay$Date <- as.Date(visitsPerDay$Date)
visitsPerDay$MBTV <- c(NA_integer_,
"is.na<-"(diff(visitsPerDay$Date),
!duplicated(visitsPerDay$CUST_ID)[-1]))
关于r - 算法效率-时差循环,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/21189073/