按组滚动/移动平均

本文介绍了按组滚动/移动平均的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如何使用分组数据生成滚动平均值.这是数据

How to generate rolling mean with grouped data. Here's the data

set.seed(31)
dd<-matrix(sample(seq(1:20),30,replace=TRUE),ncol=3)

添加组标识符，然后按组标识符进行排序

Add a group identifier, and sort by group identifier

du<-sample(seq(1:4),10,replace=TRUE)
d<-cbind(du,dd)
d<-d[order(d[,1]),]

这给出了滚动平均值，但忽略了群边界

This gives the rolling mean but ignores group bounderis

d_roll_mean <- apply(d[,2:4], 2,
                   function(x) {
                     rollapply(zoo(x), 3, mean, partial=TRUE, align='right')
                   }
)

这将给出以下结果

# cbind(d,d_roll_mean)
# [1,]  1  3  3 12  3.000000  3.000000 12.000000
# [2,]  2 10 13  8  6.500000  8.000000 10.000000
# [3,]  2 17  2 17 10.000000  6.000000 12.333333
# [4,]  3 14  6  3 13.666667  7.000000  9.333333
# [5,]  3  6 20  1 12.333333  9.333333  7.000000
# [6,]  3  1 16 19  7.000000 14.000000  7.666667
# [7,]  3 19  2 11  8.666667 12.666667 10.333333
# [8,]  4 12  1  9 10.666667  6.333333 13.000000
# [9,]  4 10 13 12 13.666667  5.333333 10.666667
# [10,]  4  8 20  7 10.000000 11.333333  9.333333

这是目标，按组边界滚动平均值

Here's the goal, rolling mean by group boundary

# Desired
# [1,]  1  3  3 12  3.000000  3.000000 12.000000
# [2,]  2 10 13  8 10.000000 13.000000  8.000000
# [3,]  2 17  2 17 13.500000  7.500000 12.500000
# [4,]  3 14  6  3 14.000000  6.000000  3.000000
# [5,]  3  6 20  1 10.000000 13.000000  2.000000
# [6,]  3  1 16 19  7.000000 14.000000  7.666667
# [7,]  3 19  2 11  8.666667 12.666667 10.333333
# [8,]  4 12  1  9 12.000000  1.000000  9.000000
# [9,]  4 10 13 12 11.000000  7.000000 10.500000
# [10,]  4  8 20  7 10.000000 8.000000  9.333333

这很接近，但是会按因子生成列表，而不是矩阵

This is close, but generates a list by factor, instead of a matrix

doApply <- function(x) {
  apply(x, 2,
        function(y) {
          rollapply(zoo(y), 3, mean, partial=TRUE, align='right')
        })
}

d2_roll_mean <- by(d[,2:4], d[,1], doApply)

这个问题有一些答案，这就是它们在执行时间上的比较方式

So there are some answers to the question, here's how they compare in execution time

set.seed(31)

nrow=20000
ncol=600
nun=350
nValues = 20
dd<-matrix(sample(seq(1:nValues),nrow*ncol,replace=TRUE),ncol=ncol)
du<-sample(seq(1:nun),nrow,replace=TRUE)
d<-cbind(du,dd)
d<-d[order(d[,1]),]
library(zoo)
doApply <- function(x) {
  apply(x, 2,
        function(y) {
          rollapply(zoo(y), 3, mean, partial=TRUE, align='right')
        })
}
library(data.table)
library(caTools)

fun1<-function(d) {by(d[,-1], d[,1], doApply)}
fun2<- function(d){
  DT <- data.table(d, key='du')
  DT[, lapply(.SD, function(y)
    runmean(y, 3, alg='fast',align='right')), by=du]
}

system.time(d2_roll_mean <- fun1(d))
system.time(d2_roll_mean2 <- fun2(d))

时间表明使用数据表的速度大约比rollapply快10倍.

The timing indicates using data tables is about 10 times faster than rollapply.

          user   system  elapsed
fun1  1048.910    0.378 1049.158
fun2   107.296    0.097  107.392

我没有平等，但是通过检查他们看起来是一样的...

I don't get equality, but by inspection they seem the same...

d2a<-do.call(rbind,d2_roll_mean)
d2b<-cbind(1,d2a)
d2c<-data.table(d2b)
setnames(d2c,names(d2c),names(d2_roll_mean2))

all.equal(d2c,d2_roll_mean2)

全部等于的输出是

[1] "Attributes: < Length mismatch: comparison on first 1 components >"
[2] "Component "du": Mean relative difference: 175.6631"

将上述方法应用于数据时，会产生以下错误

When the above approach was applied to data, the following error was generated

Error in `[<-`(`*tmp*`, (k2 + 1):n, , value = 2) :
  subscript out of bounds

此错误是某些因素导致行太少的结果.这些行被删除，并且该过程正常进行.参考:如何删除少于n个成员

This error was the result of some factors have too few rows. Those rows were removed, and the process worked. Ref: How to drop factors that have fewer than n members

推荐答案

唯一缺少的是do.call(rbind,d2_roll_mean).添加原始数据:

The only thing missing is a do.call(rbind,d2_roll_mean). Add original data:

cbind(d,do.call(rbind,d2_roll_mean))

我通过system.time()进行了一个更大的示例，这确实花费了很多时间:

I ran this through system.time() for a bigger example, and it does take its sweet time:

set.seed(31)
dd <- matrix(sample(seq(1:20),20000*500,replace=TRUE),ncol=500)
du <- sample(seq(1:350),20000,replace=TRUE)
d <- cbind(du,dd)
d <- d[order(d[,1]),]

system.time(d2_roll_mean <- by(d[,-1], d[,1], doApply))
       User      System      elapsed
     399.60        0.57       409.91

by()和apply()不是最快的功能.实际上，依靠d按ID排序这一事实，使用for循环遍历各列并通过蛮力进行操作可能会更快.

by() and apply() are not the fastest functions. It may actually be faster to walk through the columns using a for loop and doing this by brute force, relying on the fact that d is sorted by ID.

这篇关于按组滚动/移动平均的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！