r - 满足条件时如何重复序列

我已经找到了这个问题的变体，并且我知道可以使用模数，但是我很难把它们全部放在一起。

我按ID和秒进行了一系列观察。当id累积的秒数增加大于5秒时，我想重新开始计数。有人可以帮我用dplyr回答这个问题吗？

原始df

df <- data.frame(id = c(1,1,1,1,1,2,2,2,2,3,3,3,3),
                 val = c(2,10,12,15,17,2,4,7,8,12,15,20,25))

df
   id val
1   1   2
2   1  10
3   1  12
4   1  15
5   1  17
6   2   2
7   2   4
8   2   7
9   2   8
10  3  12
11  3  15
12  3  20
13  3  25

期望的结果

finalResult
   id val reset
1   1   2     1
2   1  10     2
3   1  12     2
4   1  15     3
5   1  17     3
6   2   2     1
7   2   4     1
8   2   7     2
9   2   8     2
10  3  12     1
11  3  15     1
12  3  20     2
13  3  25     3

编辑

感谢您昨天的回复，但我在给定的解决方案中遇到了一些问题。

在此数据集上，代码可在某些实例上运行。

sub.df <- structure(list(`ID` = c("1",
                                                "1", "1",
                                                "1", "1",
                                                "1", "1",
                                                "1", "1"
), dateFormat = structure(c(1479955726, 1479955726, 1483703713,
                            1495190809, 1495190809, 1497265079, 1497265079, 1474023059, 1474023061
), class = c("POSIXct", "POSIXt"), tzone = "America/Chicago")), .Names = c("ID",
                                                                           "dateFormat"), row.names = c(NA, -9L), class = c("tbl_df", "tbl",
                                                                                                                            "data.frame"))

使用的解决方案:

jj <- sub.df %>%
  group_by(`ID`) %>%
  arrange(`ID`,`dateFormat`)%>%
  mutate(totalTimeInt = difftime(dateFormat,first(dateFormat),units = 'secs'))%>%
  mutate(totalTimeFormat   = as.numeric(totalTimeInt))%>%
  mutate(reset = cumsum(
    Reduce(
      function(x, y)
        if (x + y >= 5) 0
        else x + y,

        diff(totalTimeFormat), init = 0, accumulate = TRUE
    ) == 0
  ))%>%
  mutate(reset_2 = cumsum(
    accumulate(
      diff(totalTimeFormat),
      ~if (.x + .y >= 5) 0 else .x + .y,
      .init = 0
    ) == 0
  ))

结果

# A tibble: 9 x 6
# Groups:   ID [1]
     ID          dateFormat  totalTimeInt totalTimeFormat reset reset_2
  <chr>              <dttm>        <time>           <dbl> <int>   <int>
1     1 2016-09-16 05:50:59        0 secs               0     1       1
2     1 2016-09-16 05:51:01        2 secs               2     1       1
3     1 2016-11-23 20:48:46  5932667 secs         5932667     2       2
4     1 2016-11-23 20:48:46  5932667 secs         5932667     3       3
5     1 2017-01-06 05:55:13  9680654 secs         9680654     4       4
6     1 2017-05-19 05:46:49 21167750 secs        21167750     5       5
7     1 2017-05-19 05:46:49 21167750 secs        21167750     6       6
8     1 2017-06-12 05:57:59 23242020 secs        23242020     7       7
9     1 2017-06-12 05:57:59 23242020 secs        23242020     8       8

发生的情况是，对于前两个观察，它正确地将其视为1个实例。当到达第三和第四观测值时，该值仅应视为两个观测值，因为在这两个实例之间基本上没有时间经过。

正确的输出:

# A tibble: 9 x 6
# Groups:   ID [1]
     ID          dateFormat  totalTimeInt totalTimeFormat reset reset_2
  <chr>              <dttm>        <time>           <dbl> <int>   <int>
1     1 2016-09-16 05:50:59        0 secs               0     1       1
2     1 2016-09-16 05:51:01        2 secs               2     1       1
3     1 2016-11-23 20:48:46  5932667 secs         5932667     2       2
4     1 2016-11-23 20:48:46  5932667 secs         5932667     2       2
5     1 2017-01-06 05:55:13  9680654 secs         9680654     3       3
6     1 2017-05-19 05:46:49 21167750 secs        21167750     4       4
7     1 2017-05-19 05:46:49 21167750 secs        21167750     4       4
8     1 2017-06-12 05:57:59 23242020 secs        23242020     5       5
9     1 2017-06-12 05:57:59 23242020 secs        23242020     5       5

最佳答案

如果将Reduce与accumulate = TRUE(或purrr::accumulate，如果愿意)一起使用，则可以在运行差异大于或等于5时重置运行差异。调用cumsum总计是否为0将返回重置次数。

library(tidyverse)

df <- data.frame(id = c(1,1,1,1,1,2,2,2,2,3,3,3,3),
                 val = c(2,10,12,15,17,2,4,7,8,12,15,20,25))

df %>%
    group_by(id) %>%
    mutate(reset = cumsum(
        Reduce(
            function(x, y) if (x + y >= 5) 0 else x + y,
            diff(val), init = 0, accumulate = TRUE
        ) == 0
    ))
#> # A tibble: 13 x 3
#> # Groups:   id [3]
#>       id   val reset
#>    <dbl> <dbl> <int>
#>  1     1     2     1
#>  2     1    10     2
#>  3     1    12     2
#>  4     1    15     3
#>  5     1    17     3
#>  6     2     2     1
#>  7     2     4     1
#>  8     2     7     2
#>  9     2     8     2
#> 10     3    12     1
#> 11     3    15     1
#> 12     3    20     2
#> 13     3    25     3

或使用purrr::accumulate，

df %>%
    group_by(id) %>%
    mutate(reset = cumsum(
        accumulate(
            diff(val),
            ~if (.x + .y >= 5) 0 else .x + .y,
            .init = 0
        ) == 0
    ))
#> # A tibble: 13 x 3
#> # Groups:   id [3]
#>       id   val reset
#>    <dbl> <dbl> <int>
#>  1     1     2     1
#>  2     1    10     2
#>  3     1    12     2
#>  4     1    15     3
#>  5     1    17     3
#>  6     2     2     1
#>  7     2     4     1
#>  8     2     7     2
#>  9     2     8     2
#> 10     3    12     1
#> 11     3    15     1
#> 12     3    20     2
#> 13     3    25     3

关于编辑，问题在于一些差异为0，这与查看重置的计数相同。最简单的解决方案是使用NA而不是零作为重置值:

library(tidyverse)

sub.df <- structure(list(`ID` = c("1", "1", "1", "1", "1", "1", "1", "1", "1"),
                         dateFormat = structure(c(1479955726, 1479955726, 1483703713,
                            1495190809, 1495190809, 1497265079, 1497265079, 1474023059, 1474023061),
                            class = c("POSIXct", "POSIXt"), tzone = "America/Chicago")),
                    .Names = c("ID", "dateFormat"), row.names = c(NA, -9L),
                    class = c("tbl_df", "tbl", "data.frame"))

sub.df %>%
    group_by(ID) %>%
    arrange(ID, dateFormat) %>%
    mutate(reset = cumsum(is.na(
               accumulate(diff(dateFormat),
                          ~{
                              s <- sum(.x, .y, na.rm = TRUE);
                              if (s >= 5) NA else s
                          },
                          .init = NA)
    )))
#> # A tibble: 9 x 3
#> # Groups:   ID [1]
#>      ID          dateFormat reset
#>   <chr>              <dttm> <int>
#> 1     1 2016-09-16 05:50:59     1
#> 2     1 2016-09-16 05:51:01     1
#> 3     1 2016-11-23 20:48:46     2
#> 4     1 2016-11-23 20:48:46     2
#> 5     1 2017-01-06 05:55:13     3
#> 6     1 2017-05-19 05:46:49     4
#> 7     1 2017-05-19 05:46:49     4
#> 8     1 2017-06-12 05:57:59     5
#> 9     1 2017-06-12 05:57:59     5

最终，这种方法也同样面临着局限性，就好像任何值实际上是NA一样，它也会类似地递增。一种更可靠的解决方案是从每次迭代中返回两个元素的列表，一个用于重置的总数，一个用于重置计数。不过，这是要执行的更多工作:

sub.df %>%
    group_by(ID) %>%
    arrange(ID, dateFormat) %>%
    mutate(total_reset = accumulate(
        transpose(list(total = diff(dateFormat), reset = rep(0, n() - 1))),
        ~{
            s <- .x$total + .y$total;
            if (s >= 5) {
                data_frame(total = 0, reset = .x$reset + 1)
            } else {
                data_frame(total = s, reset = .x$reset)
            }
        },
        .init = data_frame(total = 0, reset = 1)
    )) %>%
    unnest()
#> # A tibble: 9 x 4
#> # Groups:   ID [1]
#>      ID          dateFormat total reset
#>   <chr>              <dttm> <dbl> <dbl>
#> 1     1 2016-09-16 05:50:59     0     1
#> 2     1 2016-09-16 05:51:01     2     1
#> 3     1 2016-11-23 20:48:46     0     2
#> 4     1 2016-11-23 20:48:46     0     2
#> 5     1 2017-01-06 05:55:13     0     3
#> 6     1 2017-05-19 05:46:49     0     4
#> 7     1 2017-05-19 05:46:49     0     4
#> 8     1 2017-06-12 05:57:59     0     5
#> 9     1 2017-06-12 05:57:59     0     5

总数看起来有些愚蠢，但是如果您看一下差异，那实际上是正确的。

关于r - 满足条件时如何重复序列，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/47680783/