我有一个看起来像这样的序列
id ep value
1 1 1 a
2 1 2 a
3 1 3 b
4 1 4 d
5 2 1 a
6 2 2 a
7 2 3 c
8 2 4 e
而我想做的就是减少到
id ep value n time total
1 1 0 a 2 20 40
2 1 1 b 1 10 40
3 1 2 d 1 10 40
4 2 0 a 2 20 40
5 2 1 c 1 10 40
6 2 2 e 1 10 40
dplyr
似乎工作正常short = df %>% group_by(id) %>%
mutate(grp = cumsum(value != lag(value, default = value[1]))) %>%
count(id, grp, value) %>% mutate(time = n*10) %>% group_by(id) %>%
mutate(total = sum(time))
但是,我的数据库确实很大,而且需要花费很多时间。
问题1
谁能帮助我将此行转换为
data.table
代码?问题2
然后,我也有兴趣回到 long 格式,
我想知道什么是最有效的速度解决方案。
目前,我正在使用这条线
short[rep(1:nrow(short), short$n), ] %>%
select(-n, -time, -total) %>%
group_by(id) %>%
mutate(ep = 1:n())
有什么建议么?
df = structure(list(id = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L
), .Label = c("1", "2"), class = "factor"), ep = structure(c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L), .Label = c("1", "2", "3", "4"), class = "factor"),
value = structure(c(1L, 1L, 2L, 4L, 1L, 1L, 3L, 5L), .Label = c("a",
"b", "c", "d", "e"), class = "factor")), .Names = c("id",
"ep", "value"), row.names = c(NA, -8L), class = "data.frame")
最佳答案
一种选择是使用rleid
中的data.table
library(data.table)
short1 <- setDT(df)[, .N,.(id, grp = rleid(value), value)
][, time := N*10
][, c('total', 'ep') := .(sum(time), seq_len(.N) - 1), id
][, grp := NULL][]
short1
# id value N time total ep
#1: 1 a 2 20 40 0
#2: 1 b 1 10 40 1
#3: 1 d 1 10 40 2
#4: 2 a 2 20 40 0
#5: 2 c 1 10 40 1
#6: 2 e 1 10 40 2
得出“长”格式将是
short1[rep(seq_len(.N), N), -c('N', 'time', 'total', 'ep'),
with = FALSE][, ep1 := seq_len(.N), id][]
将
dplyr
代码直接转换为data.table
将是setDT(df)[, grp := cumsum(value != shift(value, fill = value[1])), id
][, .(N= .N), .(id, grp, value)
][, time := N*10
][, c('total', 'ep') := .(sum(time), seq_len(.N) - 1), id
][, grp := NULL][]
# id value N time total ep
#1: 1 a 2 20 40 0
#2: 1 b 1 10 40 1
#3: 1 d 1 10 40 2
#4: 2 a 2 20 40 0
#5: 2 c 1 10 40 1
#6: 2 e 1 10 40 2
关于r - 计数不同的序列模式,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/47941391/