table中执行行操作

table中执行行操作

本文介绍了使用`by = .I`在data.table中执行行操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下是关于

Here is a good SO explanation about row operations in data.table

我想到的一个替代方法是使用一个独特的 id ,然后使用通过参数应用函数。像这样:

One alternative that came to my mind is to use a unique id for each row and then apply a function using the by argument. Like this:

library(data.table)

dt <- data.table(V0 =LETTERS[c(1,1,2,2,3)],
                 V1=1:5,
                 V2=3:7,
                 V3=5:1)

# create a column with row positions
dt[, rowpos := .I]

# calculate standard deviation by row
dt[ ,  sdd := sd(.SD[, -1, with=FALSE]), by = rowpos ]




  1. 有没有理由不使用这种方法?

  1. Is there a good reason not to use this approach? perhaps other more efficient alternatives?

为什么使用 by = .I ?

dt [,sdd:= sd(.SD [,-1,with = FALSE]),by = .I] code>

dt[ , sdd := sd(.SD[, -1, with=FALSE]), by = .I ]


推荐答案

1)嗯,不使用它的一个原因,至少对于 rowsums 的例子是性能和创建一个不必要的列。与下面的选项f2比较,它几乎快了4倍,并且不需要rowpos列:

1) Well, one reason not to use it, at least for the rowsums example is performance, and creation of an unnecessary column. Compare to option f2 below, which is almost 4x faster and does not need the rowpos column:

dt <- data.table(V0 =LETTERS[c(1,1,2,2,3)], V1=1:5, V2=3:7, V3=5:1)
f1 <- function(dt){
  dt[, rowpos := .I]
  dt[ ,  sdd := rowSums(.SD[, 2:4, with=FALSE]), by = rowpos ] }
f2 <- function(dt){dt[, sdd := rowSums(dt[, 2:4, with=FALSE])]}

library(microbenchmark)
microbenchmark(f1(dt),f2(dt))
# Unit: milliseconds
#   expr      min       lq     mean   median       uq      max neval cld
# f1(dt) 3.669049 3.732434 4.013946 3.793352 3.972714 5.834608   100   b
# f2(dt) 1.052702 1.085857 1.154132 1.105301 1.138658 2.825464   100  a

2)在第二个问题上,虽然 dt [,sdd:= sum(.SD [,2:4,with = FALSE]),by = .I] , dt [,sdd:= sum(.SD [,2:4,with = FALSE]),by = 1:NROW(dt)] 假设根据?data.table ,我是一个等于seq_len(nrow(x))的整数向量等同。不同之处在于 .I 用于 j ,而不是 ,因为它的值由通过返回,而不是预先计算。

2) On your second question, although dt[, sdd := sum(.SD[, 2:4, with=FALSE]), by = .I] does not work, dt[, sdd := sum(.SD[, 2:4, with=FALSE]), by = 1:NROW(dt)] works perfectly. Given that according to ?data.table ".I is an integer vector equal to seq_len(nrow(x))", one might expect these to be equivalent. The difference, however, is that .I is for use in j, not in by, because it's value is returned by by rather than evaluated beforehand.

它也可能是预期的(参见@eddi上面的问题的注释), by = .I 抛出一个错误。但这不会发生,因为加载 data.table 包会在data.table命名空间中创建一个对象 .I 可从全局环境访问,并且其值为 NULL 。您可以通过在命令提示符下键入 .I 来测试。 (注意,同样适用于 .SD , .EACHI , .N , .GRP 和 .BY )

It might also be expected (see comment on question above from @eddi) that by = .I should just throw an error. But this does not occur, because loading the data.table package creates an object .I in the data.table namespace that is accessible from the global environment, and whose value is NULL. You can test this by typing .I at the command prompt. (Note, the same applies to .SD, .EACHI, .N, .GRP, and .BY)

.I
# Error: object '.I' not found
library(data.table)
.I
# NULL
data.table::.I
# NULL

是 by = .I 的行为相当于 by = NULL 。

3)虽然我们已经在第1部分中看到,在 rowSums 有效地,有比创建rowpos列更快的方式。但是,当我们没有快速的逐行函数时,循环怎么办?

3) Although we have already seen in part 1 that in the case of rowSums, which already loops row-wise efficiently, there are much faster ways than creating the rowpos column. But what about looping when we don't have a fast row-wise function?

用 by = rowpos 和 by = 1:NROW(dt)版本与 在这里是信息性的,并且表明循环版本比 by = 方法更快:

Benchmarking the by = rowpos and by = 1:NROW(dt) versions against a for loop with set() is informative here, and demonstrates that the loop version is faster than either of the by = approaches:

f.rowpos <- function(){
  dt <- data.table(V0 = rep(LETTERS[c(1,1,2,2,3)], 1e3), V1=1:5, V2=3:7, V3=5:1)
  dt[, rowpos := .I]
  dt[ ,  sdd := sum(.SD[, 2:4, with=FALSE]), by = rowpos ][]
}

f.nrow <- function(){
  dt <- data.table(V0 = rep(LETTERS[c(1,1,2,2,3)], 1e3), V1=1:5, V2=3:7, V3=5:1)
  dt[, sdd := sum(.SD[, 2:4, with=FALSE]), by = 1:NROW(dt) ][]
}

f.forset<- function(){
  dt <- data.table(V0 = rep(LETTERS[c(1,1,2,2,3)], 1e3), V1=1:5, V2=3:7, V3=5:1)
  dt[, sdd:=0L]
  for (i in 1L:NROW(dt)) {
    set(dt, i, 5L, sum(dt[i, 2:4]))
  }
  dt
}

microbenchmark(f.rowpos(),f.nrow(), f.forset(), times = 5)
Unit: seconds
       expr      min       lq     mean   median       uq      max neval cld
 f.rowpos() 4.465371 4.503614 4.510916 4.505922 4.521629 4.558042     5   b
   f.nrow() 4.499120 4.499920 4.541131 4.558701 4.571267 4.576647     5   b
 f.forset() 2.540556 2.603505 2.654036 2.606108 2.750719 2.769292     5  a

因此,结论,即使没有优化的函数,如 rowSums 按行,总是可以选择使用更快的rowpos列,而不需要创建冗余列。

So, in conclusion, even in situations where there is not an optimised function such as rowSums that already operates by row, there are always alternatives to using a rowpos column that are faster, while not requiring creation of a redundant column.

这篇关于使用`by = .I`在data.table中执行行操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-14 19:53