我想在分组数据的滑动窗口上计算总和。
因为我想尽可能地坚持官方职能,所以我从rollapplyr开始,如下所示:
library(tidyverse)
library(reshape2)
library(zoo)
data = data.frame(Count=seq(1,10,1),
group=c("A","B","A","A","B","B","B","B","A","A"))
window_size <- 3
data_rolling <- data %>%
arrange(group) %>%
group_by(group) %>%
mutate(Rolling_Count = rollapplyr(Count, width=window_size, FUN=sum, fill = NA)) %>%
ungroup()
对于小于宽度的第一个条目(在这种情况下为3),它会按定义填充NA,但实际上,我想像这样存储可能的数据总和:
Count group Rolling_Count expected_Result
1 A NA 1
3 A NA 4
4 A 8 8
9 A 16 16
10 A 23 23
2 B NA 2
5 B NA 7
6 B 13 13
7 B 18 18
8 B 21 21
我知道我可以用以下内容替换
width=window_size
:c(rep(1:window_size,1),rep(window_size:window_size,(n()-window_size)))
得到我想要的东西,但这真的很慢。另外,该方法将假定n()大于window_size。
因此:是否已经有一个R / zoo函数可以处理上述分组数据,另外还具有少于window_size条目的数据,并且比上述方法更快?
感谢您的提示!
最佳答案
基于data.table
和RcppRoll
的解决方案应该性能更高。
它不像我想要的那么干净-实际上partial
中有一个RcppRoll::roll_sum()
参数尚未实现,从理论上讲可以很干净地解决此问题,但似乎不会很快就可以使用-参见GH Issue #18。
无论如何,除非有人在R中实现允许您在此处使用的滚动总和,否则在第一个cumsum
行上添加n - 1
似乎是一个明智的解决方案。
library(data.table)
library(RcppRoll)
data = data.frame(Count=seq(1,10,1),
group=c("A","B","A","A","B","B","B","B","A","A"))
## Convert to a `data.table` by reference
setDT(data)
window_size <- 3
## Add a counter row so that we can go back and fill in rows
## 1 & 2 of each group
data[,Group_RowNumber := seq_len(.N), keyby = .(group)]
## Do a rolling window -- this won't fill in the first 2 rows
data[,Rolling_Count := RcppRoll::roll_sum(Count,
n = window_size,
align = "right",
fill = NA), keyby = .(group)]
## Go back and fill in the ones we missed
data[Group_RowNumber < window_size, Rolling_Count := cumsum(Count), by = .(group)]
data
# Count group Group_RowNumber Rolling_Count
# 1: 1 A 1 1
# 2: 3 A 2 4
# 3: 4 A 3 8
# 4: 9 A 4 16
# 5: 10 A 5 23
# 6: 2 B 1 2
# 7: 5 B 2 7
# 8: 6 B 3 13
# 9: 7 B 4 18
# 10: 8 B 5 21