我有一个用户 ID 和进行交易的月份的示例数据集。我的目标是逐月计算有多少原始用户进行了交易。换句话说,有多少 1 月的新用户也在 2 月、3 月和 4 月进行了交易。有多少 2 月的新用户在 3 月和 4 月进行了交易,依此类推。
> data
date user_id
1 Jan 2017 1
2 Jan 2017 2
3 Jan 2017 3
4 Jan 2017 4
5 Jan 2017 5
6 Feb 2017 1
7 Feb 2017 3
8 Feb 2017 5
9 Feb 2017 7
10 Feb 2017 9
11 Mar 2017 2
12 Mar 2017 4
13 Mar 2017 6
14 Mar 2017 8
15 Mar 2017 10
16 Apr 2017 1
17 Apr 2017 3
18 Apr 2017 6
19 Apr 2017 9
20 Apr 2017 12
该数据集的输出如下所示:
> output
Jan Feb Mar Apr
Jan 5 3 2 2
Feb NA 2 0 1
Mar NA NA 3 1
Apr NA NA NA 1
到目前为止,我能想到的唯一方法是拆分数据集,然后计算前几个月不存在的每个月的唯一 id,但这种方法很冗长,不适合有很多个月的大型数据集.
subsets <-split(data, data$date, drop=TRUE)
for (i in 1:length(subsets)) {
assign(paste0("M", i), as.data.frame(subsets[[i]]))
}
M1_ids <- unique(M1$user_id)
M2_ids <- unique(M2$user_id)
M3_ids <- unique(M3$user_id)
M4_ids <- unique(M4$user_id)
M2_ids <- unique(setdiff(M2_ids, unique(M1_ids)))
M3_ids <- unique(setdiff(M3_ids, unique(c(M2_ids, M1_ids))))
M4_ids <- unique(setdiff(M4_ids, unique(c(M3_ids, M2_ids, M1_ids))))
在 R 中有没有办法使用
dplyr
甚至基础 R 以更短的方法提出上述输出?真实的数据集有很多年和几个月。数据格式如下:
> sapply(data, class)
date user_id
"yearmon" "integer"
和样本数据:
> dput(data)
structure(list(date = structure(c(2017, 2017, 2017, 2017, 2017,
2017.08333333333, 2017.08333333333, 2017.08333333333, 2017.08333333333,
2017.08333333333, 2017.16666666667, 2017.16666666667, 2017.16666666667,
2017.16666666667, 2017.16666666667, 2017.25, 2017.25, 2017.25,
2017.25, 2017.25), class = "yearmon"), user_id = c(1L, 2L, 3L,
4L, 5L, 1L, 3L, 5L, 7L, 9L, 2L, 4L, 6L, 8L, 10L, 1L, 3L, 6L,
9L, 12L)), .Names = c("date", "user_id"), row.names = c(NA, -20L
), class = "data.frame")
最佳答案
下面是一个例子:
library(data.table)
library(zoo)
data <- structure(list(date = structure(c(2017, 2017, 2017, 2017, 2017,
2017.08333333333, 2017.08333333333, 2017.08333333333, 2017.08333333333,
2017.08333333333, 2017.16666666667, 2017.16666666667, 2017.16666666667,
2017.16666666667, 2017.16666666667, 2017.25, 2017.25, 2017.25,
2017.25, 2017.25), class = "yearmon"), user_id = c(1L, 2L, 3L,
4L, 5L, 1L, 3L, 5L, 7L, 9L, 2L, 4L, 6L, 8L, 10L, 1L, 3L, 6L,
9L, 12L)), .Names = c("date", "user_id"), row.names = c(NA, -20L
), class = "data.frame")
data <- data[c(1,1:nrow(data)),]
setDT(data)
(cohorts <- dcast(unique(data)[,cohort:=min(date),by=user_id],cohort~date))
# cohort Jan 2017 Feb 2017 Mrz 2017 Apr 2017
# 1: Jan 2017 5 3 2 2
# 2: Feb 2017 0 2 0 1
# 3: Mrz 2017 0 0 3 1
# 4: Apr 2017 0 0 0 1
m <- as.matrix(cohorts[,-1])
rownames(m) <- cohorts[[1]]
m[lower.tri(m)] <- NA
names(dimnames(m)) <- c("cohort", "yearmon")
m
# yearmon
# cohort Jan 2017 Feb 2017 Mrz 2017 Apr 2017
# Jan 2017 5 3 2 2
# Feb 2017 NA 2 0 1
# Mrz 2017 NA NA 3 1
# Apr 2017 NA NA NA 1
关于r - 随着时间的推移在 R 中跟踪队列,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/45847641/