我已经找到了这个问题的变体,并且我知道可以使用模数,但是我很难把它们全部放在一起。
我按ID和秒进行了一系列观察。当id累积的秒数增加大于5秒时,我想重新开始计数。有人可以帮我用dplyr回答这个问题吗?
原始df
df <- data.frame(id = c(1,1,1,1,1,2,2,2,2,3,3,3,3),
val = c(2,10,12,15,17,2,4,7,8,12,15,20,25))
df
id val
1 1 2
2 1 10
3 1 12
4 1 15
5 1 17
6 2 2
7 2 4
8 2 7
9 2 8
10 3 12
11 3 15
12 3 20
13 3 25
期望的结果
finalResult
id val reset
1 1 2 1
2 1 10 2
3 1 12 2
4 1 15 3
5 1 17 3
6 2 2 1
7 2 4 1
8 2 7 2
9 2 8 2
10 3 12 1
11 3 15 1
12 3 20 2
13 3 25 3
编辑
感谢您昨天的回复,但我在给定的解决方案中遇到了一些问题。
在此数据集上,代码可在某些实例上运行。
sub.df <- structure(list(`ID` = c("1",
"1", "1",
"1", "1",
"1", "1",
"1", "1"
), dateFormat = structure(c(1479955726, 1479955726, 1483703713,
1495190809, 1495190809, 1497265079, 1497265079, 1474023059, 1474023061
), class = c("POSIXct", "POSIXt"), tzone = "America/Chicago")), .Names = c("ID",
"dateFormat"), row.names = c(NA, -9L), class = c("tbl_df", "tbl",
"data.frame"))
使用的解决方案:
jj <- sub.df %>%
group_by(`ID`) %>%
arrange(`ID`,`dateFormat`)%>%
mutate(totalTimeInt = difftime(dateFormat,first(dateFormat),units = 'secs'))%>%
mutate(totalTimeFormat = as.numeric(totalTimeInt))%>%
mutate(reset = cumsum(
Reduce(
function(x, y)
if (x + y >= 5) 0
else x + y,
diff(totalTimeFormat), init = 0, accumulate = TRUE
) == 0
))%>%
mutate(reset_2 = cumsum(
accumulate(
diff(totalTimeFormat),
~if (.x + .y >= 5) 0 else .x + .y,
.init = 0
) == 0
))
结果
# A tibble: 9 x 6
# Groups: ID [1]
ID dateFormat totalTimeInt totalTimeFormat reset reset_2
<chr> <dttm> <time> <dbl> <int> <int>
1 1 2016-09-16 05:50:59 0 secs 0 1 1
2 1 2016-09-16 05:51:01 2 secs 2 1 1
3 1 2016-11-23 20:48:46 5932667 secs 5932667 2 2
4 1 2016-11-23 20:48:46 5932667 secs 5932667 3 3
5 1 2017-01-06 05:55:13 9680654 secs 9680654 4 4
6 1 2017-05-19 05:46:49 21167750 secs 21167750 5 5
7 1 2017-05-19 05:46:49 21167750 secs 21167750 6 6
8 1 2017-06-12 05:57:59 23242020 secs 23242020 7 7
9 1 2017-06-12 05:57:59 23242020 secs 23242020 8 8
发生的情况是,对于前两个观察,它正确地将其视为1个实例。当到达第三和第四观测值时,该值仅应视为两个观测值,因为在这两个实例之间基本上没有时间经过。
正确的输出:
# A tibble: 9 x 6
# Groups: ID [1]
ID dateFormat totalTimeInt totalTimeFormat reset reset_2
<chr> <dttm> <time> <dbl> <int> <int>
1 1 2016-09-16 05:50:59 0 secs 0 1 1
2 1 2016-09-16 05:51:01 2 secs 2 1 1
3 1 2016-11-23 20:48:46 5932667 secs 5932667 2 2
4 1 2016-11-23 20:48:46 5932667 secs 5932667 2 2
5 1 2017-01-06 05:55:13 9680654 secs 9680654 3 3
6 1 2017-05-19 05:46:49 21167750 secs 21167750 4 4
7 1 2017-05-19 05:46:49 21167750 secs 21167750 4 4
8 1 2017-06-12 05:57:59 23242020 secs 23242020 5 5
9 1 2017-06-12 05:57:59 23242020 secs 23242020 5 5
最佳答案
如果将Reduce
与accumulate = TRUE
(或purrr::accumulate
,如果愿意)一起使用,则可以在运行差异大于或等于5时重置运行差异。调用cumsum
总计是否为0将返回重置次数。
library(tidyverse)
df <- data.frame(id = c(1,1,1,1,1,2,2,2,2,3,3,3,3),
val = c(2,10,12,15,17,2,4,7,8,12,15,20,25))
df %>%
group_by(id) %>%
mutate(reset = cumsum(
Reduce(
function(x, y) if (x + y >= 5) 0 else x + y,
diff(val), init = 0, accumulate = TRUE
) == 0
))
#> # A tibble: 13 x 3
#> # Groups: id [3]
#> id val reset
#> <dbl> <dbl> <int>
#> 1 1 2 1
#> 2 1 10 2
#> 3 1 12 2
#> 4 1 15 3
#> 5 1 17 3
#> 6 2 2 1
#> 7 2 4 1
#> 8 2 7 2
#> 9 2 8 2
#> 10 3 12 1
#> 11 3 15 1
#> 12 3 20 2
#> 13 3 25 3
或使用
purrr::accumulate
,df %>%
group_by(id) %>%
mutate(reset = cumsum(
accumulate(
diff(val),
~if (.x + .y >= 5) 0 else .x + .y,
.init = 0
) == 0
))
#> # A tibble: 13 x 3
#> # Groups: id [3]
#> id val reset
#> <dbl> <dbl> <int>
#> 1 1 2 1
#> 2 1 10 2
#> 3 1 12 2
#> 4 1 15 3
#> 5 1 17 3
#> 6 2 2 1
#> 7 2 4 1
#> 8 2 7 2
#> 9 2 8 2
#> 10 3 12 1
#> 11 3 15 1
#> 12 3 20 2
#> 13 3 25 3
关于编辑,问题在于一些差异为0,这与查看重置的计数相同。最简单的解决方案是使用
NA
而不是零作为重置值:library(tidyverse)
sub.df <- structure(list(`ID` = c("1", "1", "1", "1", "1", "1", "1", "1", "1"),
dateFormat = structure(c(1479955726, 1479955726, 1483703713,
1495190809, 1495190809, 1497265079, 1497265079, 1474023059, 1474023061),
class = c("POSIXct", "POSIXt"), tzone = "America/Chicago")),
.Names = c("ID", "dateFormat"), row.names = c(NA, -9L),
class = c("tbl_df", "tbl", "data.frame"))
sub.df %>%
group_by(ID) %>%
arrange(ID, dateFormat) %>%
mutate(reset = cumsum(is.na(
accumulate(diff(dateFormat),
~{
s <- sum(.x, .y, na.rm = TRUE);
if (s >= 5) NA else s
},
.init = NA)
)))
#> # A tibble: 9 x 3
#> # Groups: ID [1]
#> ID dateFormat reset
#> <chr> <dttm> <int>
#> 1 1 2016-09-16 05:50:59 1
#> 2 1 2016-09-16 05:51:01 1
#> 3 1 2016-11-23 20:48:46 2
#> 4 1 2016-11-23 20:48:46 2
#> 5 1 2017-01-06 05:55:13 3
#> 6 1 2017-05-19 05:46:49 4
#> 7 1 2017-05-19 05:46:49 4
#> 8 1 2017-06-12 05:57:59 5
#> 9 1 2017-06-12 05:57:59 5
最终,这种方法也同样面临着局限性,就好像任何值实际上是
NA
一样,它也会类似地递增。一种更可靠的解决方案是从每次迭代中返回两个元素的列表,一个用于重置的总数,一个用于重置计数。不过,这是要执行的更多工作:sub.df %>%
group_by(ID) %>%
arrange(ID, dateFormat) %>%
mutate(total_reset = accumulate(
transpose(list(total = diff(dateFormat), reset = rep(0, n() - 1))),
~{
s <- .x$total + .y$total;
if (s >= 5) {
data_frame(total = 0, reset = .x$reset + 1)
} else {
data_frame(total = s, reset = .x$reset)
}
},
.init = data_frame(total = 0, reset = 1)
)) %>%
unnest()
#> # A tibble: 9 x 4
#> # Groups: ID [1]
#> ID dateFormat total reset
#> <chr> <dttm> <dbl> <dbl>
#> 1 1 2016-09-16 05:50:59 0 1
#> 2 1 2016-09-16 05:51:01 2 1
#> 3 1 2016-11-23 20:48:46 0 2
#> 4 1 2016-11-23 20:48:46 0 2
#> 5 1 2017-01-06 05:55:13 0 3
#> 6 1 2017-05-19 05:46:49 0 4
#> 7 1 2017-05-19 05:46:49 0 4
#> 8 1 2017-06-12 05:57:59 0 5
#> 9 1 2017-06-12 05:57:59 0 5
总数看起来有些愚蠢,但是如果您看一下差异,那实际上是正确的。
关于r - 满足条件时如何重复序列,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/47680783/