本文介绍了计算data.table中连续分组列之间的差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的资料结构如下:

DT <- data.table(Id=c(1,2,3,4,5), Va1=c(3,13,NA,NA,NA), Va2=c(4,40,NA,NA,4), Va3=c(5,34,NA,7,84),
Va4=c(2,23,NA,63,9), Vb1=c(8,45,1,7,0), Vb2=c(0,35,0,7,6), Vb3=c(63,0,0,0,5), Vc1=c(2,5,0,0,4))
>DT
   Id Va1 Va2 Va3 Va4 Vb1 Vb2 Vb3 Vc1
1:  1   3   4   5   2   8   0  63   2
2:  2  13  40  34  23  45  35   0   5
3:  3  NA  NA  NA  NA   1   0   0   0
4:  4  NA  NA   7  63   7   7   0   0
5:  5  NA   4  84   9   0   6   5   4

另外,我有一个引用列表,引用所有列组:

additionally, I have a reference list that references all the column groups:

reference <- list(g.1=c(2,3,4,5), g.2=c(6,7,8), g.3=c(9))

列2,3,4,5(变量 Va1 Va2 Va3 Va4 )属于一组变量。列6,7,8(变量 Vb1 Vb2 Vb3 )属于第二组。第9列(变量 Vc1 )属于第三组。

Columns 2,3,4,5 (variables Va1, Va2, Va3, and Va4) belong to one group of variables. Columns 6,7,8 (variables Vb1, Vb2, Vb3) belong to a second group. Column 9 (variable Vc1) belongs to a third group.

我需要做的是计算列组中的连续列。

What I need to do is calculate the difference between consecutive columns within column groups.

我需要找到Va2和Va1之间的差异,以及Va3和Va2之间的差异,但是在Vb1和Va4之间

I.e. I need to find the difference between Va2 and Va1, and between Va3 and Va2, etc... but not between Vb1 and Va4.

输出应为:

   Id Va1 Va2 Va3 Va4 Vb1 Vb2 Vb3 Vc1 D[Va1:Va2] D[Va2:Va3] D[Va3:Va4] D[Vb1:Vb2] D[Vb2:Vb3]
1:  1   3   4   5   2   8   0  63   2          1          1         -3         -8         63
2:  2  13  40  34  23  45  35   0   5         27         -6        -11        -10        -35
3:  3  NA  NA  NA  NA   1   0   0   0         NA         NA         NA         -1          0
4:  4  NA  NA   7  63   7   7   0   0         NA         NA         56          0         -7
5:  5  NA   4  84   9   0   6   5   4         NA         80        -75          6         -1






目前我正在使用以下循环:


Currently I am using the following loop:

  for(i in 1:(length(reference)-1)){
    tmp <- NULL
    tmp <- as.list(reference[[i]])
    tmp <- tmp[-length(tmp)]
    tmp <- mapply(c, lapply(tmp, FUN = function(x) x+1), tmp, SIMPLIFY=FALSE)
    for(j in 1:length(tmp)){
      data <- cbind(data, delta = data[, tmp[[j]][1], with = F] - data[, tmp[[j]][2], with = F])
    }
  }

我的实际数据表有300-500列和+ 1'000'000行。

but my real data.table has 300-500 columns and +1'000'000 rows.

我如何使这更高效? / p>

How can I make this more efficient?

推荐答案

我认为你的循环很好,除非你应该使用:= 而不是 cbind 添加列:

I think your loop is fine, except you should use := instead of cbind to add columns:

ref <- lapply(reference,function(x) names(DT)[x])

for (g in ref){
    if (length(g)==1) next
    gx   = tail(g,-1)
    gy   = head(g,-1)
    gn   = paste0("D[",gy,":",gx,"]")
    DT[,(gn) := mapply(function(x,y).SD[[x]]-.SD[[y]], gx, gy, SIMPLIFY=FALSE)]
}

这篇关于计算data.table中连续分组列之间的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-13 17:55