我有一个看起来像这样的序列

  id ep value
1  1  1     a
2  1  2     a
3  1  3     b
4  1  4     d
5  2  1     a
6  2  2     a
7  2  3     c
8  2  4     e

而我想做的就是减少到
      id    ep  value     n  time total
1      1     0      a     2    20    40
2      1     1      b     1    10    40
3      1     2      d     1    10    40
4      2     0      a     2    20    40
5      2     1      c     1    10    40
6      2     2      e     1    10    40
dplyr似乎工作正常
short = df %>% group_by(id) %>%
 mutate(grp = cumsum(value != lag(value, default = value[1]))) %>%
  count(id, grp, value) %>% mutate(time = n*10) %>% group_by(id) %>%
   mutate(total = sum(time))

但是,我的数据库确实很大,而且需要花费很多时间。

问题1

谁能帮助我将此行转换为data.table代码?

问题2

然后,我也有兴趣回到 long 格式,
我想知道什么是最有效的速度解决方案。

目前,我正在使用这条线
short[rep(1:nrow(short), short$n), ] %>%
  select(-n, -time, -total) %>%
  group_by(id) %>%
  mutate(ep = 1:n())

有什么建议么?
df = structure(list(id = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L
), .Label = c("1", "2"), class = "factor"), ep = structure(c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L), .Label = c("1", "2", "3", "4"), class = "factor"),
value = structure(c(1L, 1L, 2L, 4L, 1L, 1L, 3L, 5L), .Label = c("a",
"b", "c", "d", "e"), class = "factor")), .Names = c("id",
"ep", "value"), row.names = c(NA, -8L), class = "data.frame")

最佳答案

一种选择是使用rleid中的data.table

library(data.table)
short1 <- setDT(df)[,  .N,.(id, grp = rleid(value), value)
           ][,  time := N*10
            ][, c('total', 'ep') :=  .(sum(time), seq_len(.N) - 1), id
             ][, grp := NULL][]
short1
#   id value N time total ep
#1:  1     a 2   20    40  0
#2:  1     b 1   10    40  1
#3:  1     d 1   10    40  2
#4:  2     a 2   20    40  0
#5:  2     c 1   10    40  1
#6:  2     e 1   10    40  2

得出“长”格式将是
short1[rep(seq_len(.N), N), -c('N', 'time', 'total', 'ep'),
             with = FALSE][, ep1 := seq_len(.N), id][]

dplyr代码直接转换为data.table将是
setDT(df)[, grp := cumsum(value != shift(value, fill = value[1])), id
   ][, .(N= .N), .(id, grp, value)
    ][, time := N*10
     ][, c('total', 'ep') :=  .(sum(time), seq_len(.N) - 1), id
       ][, grp := NULL][]
#   id value N time total ep
#1:  1     a 2   20    40  0
#2:  1     b 1   10    40  1
#3:  1     d 1   10    40  2
#4:  2     a 2   20    40  0
#5:  2     c 1   10    40  1
#6:  2     e 1   10    40  2

关于r - 计数不同的序列模式,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/47941391/

10-12 17:54