以内存高效的方式增长 data.frame

本文介绍了以内存高效的方式增长 data.frame的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

根据逐行创建 R 数据帧，使用 rbind 附加到 data.frame 并不理想，因为它每次都会创建整个 data.frame 的副本.如何在 R 中累积数据，从而生成 data.frame 而不招致这种惩罚?中间格式不需要是 data.frame.

According to Creating an R dataframe row-by-row, it's not ideal to append to a data.frame using rbind, as it creates a copy of the whole data.frame each time. How do I accumulate data in R resulting in a data.frame without incurring this penalty? The intermediate format doesn't need to be a data.frame.

推荐答案

第一种方法

我尝试访问预先分配的 data.frame 的每个元素:

I tried accessing each element of a pre-allocated data.frame:

res <- data.frame(x=rep(NA,1000), y=rep(NA,1000))
tracemem(res)
for(i in 1:1000) {
  res[i,"x"] <- runif(1)
  res[i,"y"] <- rnorm(1)
}

但是 tracemem 变得疯狂(例如，data.frame 每次都被复制到一个新地址).

But tracemem goes crazy (e.g. the data.frame is being copied to a new address each time).

替代方法(也不起作用)

一种方法(不确定它是否更快，因为我还没有进行基准测试)是创建一个 data.frames 列表，然后将它们 stack 放在一起:

One approach (not sure it's faster as I haven't benchmarked yet) is to create a list of data.frames, then stack them all together:

makeRow <- function() data.frame(x=runif(1),y=rnorm(1))
res <- replicate(1000, makeRow(), simplify=FALSE ) # returns a list of data.frames
library(taRifx)
res.df <- stack(res)

不幸的是，在创建列表时，我认为您将很难预先分配.例如:

Unfortunately in creating the list I think you will be hard-pressed to pre-allocate. For instance:

> tracemem(res)
[1] "<0x79b98b0>"
> res[[2]] <- data.frame()
tracemem[0x79b98b0 -> 0x71da500]:

换句话说，替换列表的元素会导致列表被复制.我假设整个列表，但它可能只是列表中的那个元素.我不太熟悉 R 的内存管理细节.

In other words, replacing an element of the list causes the list to be copied. I assume the whole list, but it's possible it's only that element of the list. I'm not intimately familiar with the details of R's memory management.

可能是最好的方法

与当今许多速度或内存受限的进程一样，最好的方法很可能是使用 data.table 而不是 data.frame.由于 data.table 有 := 通过引用运算符赋值，它可以更新而无需重新复制:

As with many speed or memory-limited processes these days, the best approach may well be to use data.table instead of a data.frame. Since data.table has the := assign by reference operator, it can update without re-copying:

library(data.table)
dt <- data.table(x=rep(0,1000), y=rep(0,1000))
tracemem(dt)
for(i in 1:1000) {
  dt[i,x := runif(1)]
  dt[i,y := rnorm(1)]
}
# note no message from tracemem

但正如@MatthewDowle 指出的那样，set() 是在循环内执行此操作的合适方法.这样做可以让它更快:

But as @MatthewDowle points out, set() is the appropriate way to do this inside a loop. Doing so makes it faster still:

library(data.table)
n <- 10^6
dt <- data.table(x=rep(0,n), y=rep(0,n))

dt.colon <- function(dt) {
  for(i in 1:n) {
    dt[i,x := runif(1)]
    dt[i,y := rnorm(1)]
  }
}

dt.set <- function(dt) {
  for(i in 1:n) {
    set(dt,i,1L, runif(1) )
    set(dt,i,2L, rnorm(1) )
  }
}

library(microbenchmark)
m <- microbenchmark(dt.colon(dt), dt.set(dt),times=2)

(结果如下)

基准测试

循环运行 10,000 次后，数据表几乎快了整整一个数量级:

With the loop run 10,000 times, data table is almost a full order of magnitude faster:

Unit: seconds
          expr        min         lq     median         uq        max
1    test.df()  523.49057  523.49057  524.52408  525.55759  525.55759
2    test.dt()   62.06398   62.06398   62.98622   63.90845   63.90845
3 test.stack() 1196.30135 1196.30135 1258.79879 1321.29622 1321.29622

:= 和 set() 的比较:

> m
Unit: milliseconds
          expr       min        lq    median       uq      max
1 dt.colon(dt) 654.54996 654.54996 656.43429 658.3186 658.3186
2   dt.set(dt)  13.29612  13.29612  15.02891  16.7617  16.7617

请注意，这里的 n 是 10^6，而不是上面绘制的基准测试中的 10^5.所以还有一个数量级的工作，结果以毫秒而不是秒来衡量.确实令人印象深刻.

Note that n here is 10^6 not 10^5 as in the benchmarks plotted above. So there's an order of magnitude more work, and the result is measured in milliseconds not seconds. Impressive indeed.

这篇关于以内存高效的方式增长 data.frame的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！