问题描述
根据逐行创建 R 数据帧,使用 rbind
附加到 data.frame
并不理想,因为它每次都会创建整个 data.frame 的副本.如何在 R
中累积数据,从而生成 data.frame
而不招致这种惩罚?中间格式不需要是 data.frame
.
According to Creating an R dataframe row-by-row, it's not ideal to append to a data.frame
using rbind
, as it creates a copy of the whole data.frame each time. How do I accumulate data in R
resulting in a data.frame
without incurring this penalty? The intermediate format doesn't need to be a data.frame
.
推荐答案
第一种方法
我尝试访问预先分配的 data.frame 的每个元素:
I tried accessing each element of a pre-allocated data.frame:
res <- data.frame(x=rep(NA,1000), y=rep(NA,1000))
tracemem(res)
for(i in 1:1000) {
res[i,"x"] <- runif(1)
res[i,"y"] <- rnorm(1)
}
但是 tracemem 变得疯狂(例如,data.frame 每次都被复制到一个新地址).
But tracemem goes crazy (e.g. the data.frame is being copied to a new address each time).
替代方法(也不起作用)
一种方法(不确定它是否更快,因为我还没有进行基准测试)是创建一个 data.frames 列表,然后将它们 stack
放在一起:
One approach (not sure it's faster as I haven't benchmarked yet) is to create a list of data.frames, then stack
them all together:
makeRow <- function() data.frame(x=runif(1),y=rnorm(1))
res <- replicate(1000, makeRow(), simplify=FALSE ) # returns a list of data.frames
library(taRifx)
res.df <- stack(res)
不幸的是,在创建列表时,我认为您将很难预先分配.例如:
Unfortunately in creating the list I think you will be hard-pressed to pre-allocate. For instance:
> tracemem(res)
[1] "<0x79b98b0>"
> res[[2]] <- data.frame()
tracemem[0x79b98b0 -> 0x71da500]:
换句话说,替换列表的元素会导致列表被复制.我假设整个列表,但它可能只是列表中的那个元素.我不太熟悉 R 的内存管理细节.
In other words, replacing an element of the list causes the list to be copied. I assume the whole list, but it's possible it's only that element of the list. I'm not intimately familiar with the details of R's memory management.
可能是最好的方法
与当今许多速度或内存受限的进程一样,最好的方法很可能是使用 data.table
而不是 data.frame
.由于 data.table
有 :=
通过引用运算符赋值,它可以更新而无需重新复制:
As with many speed or memory-limited processes these days, the best approach may well be to use data.table
instead of a data.frame
. Since data.table
has the :=
assign by reference operator, it can update without re-copying:
library(data.table)
dt <- data.table(x=rep(0,1000), y=rep(0,1000))
tracemem(dt)
for(i in 1:1000) {
dt[i,x := runif(1)]
dt[i,y := rnorm(1)]
}
# note no message from tracemem
但正如@MatthewDowle 指出的那样,set()
是在循环内执行此操作的合适方法.这样做可以让它更快:
But as @MatthewDowle points out, set()
is the appropriate way to do this inside a loop. Doing so makes it faster still:
library(data.table)
n <- 10^6
dt <- data.table(x=rep(0,n), y=rep(0,n))
dt.colon <- function(dt) {
for(i in 1:n) {
dt[i,x := runif(1)]
dt[i,y := rnorm(1)]
}
}
dt.set <- function(dt) {
for(i in 1:n) {
set(dt,i,1L, runif(1) )
set(dt,i,2L, rnorm(1) )
}
}
library(microbenchmark)
m <- microbenchmark(dt.colon(dt), dt.set(dt),times=2)
(结果如下)
基准测试
循环运行 10,000 次后,数据表几乎快了整整一个数量级:
With the loop run 10,000 times, data table is almost a full order of magnitude faster:
Unit: seconds
expr min lq median uq max
1 test.df() 523.49057 523.49057 524.52408 525.55759 525.55759
2 test.dt() 62.06398 62.06398 62.98622 63.90845 63.90845
3 test.stack() 1196.30135 1196.30135 1258.79879 1321.29622 1321.29622
:=
和 set()
的比较:
> m
Unit: milliseconds
expr min lq median uq max
1 dt.colon(dt) 654.54996 654.54996 656.43429 658.3186 658.3186
2 dt.set(dt) 13.29612 13.29612 15.02891 16.7617 16.7617
请注意,这里的 n
是 10^6,而不是上面绘制的基准测试中的 10^5.所以还有一个数量级的工作,结果以毫秒而不是秒来衡量.确实令人印象深刻.
Note that n
here is 10^6 not 10^5 as in the benchmarks plotted above. So there's an order of magnitude more work, and the result is measured in milliseconds not seconds. Impressive indeed.
这篇关于以内存高效的方式增长 data.frame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!