问题描述
我需要rbind两个大数据帧。现在我使用 df< - rbind(df,df.extension)
但我(几乎)瞬间耗尽内存。我猜它是因为df被记录在内存中两次。我将来可能会看到更大的数据帧,所以我需要一些就地的rbind。
所以我的问题是:有没有办法在使用rbind时避免内存中的数据重复?
我发现这个问题,它使用SqlLite,但我真的希望避免使用硬盘作为缓存。
以下解决方案:
nextrow = nrow(df)+1
df [nextrow:(nextrow + nrow(df .extension)-1),] = df.extension
#我们需要确保唯一的行名称
row.names(df)= 1:nrow(df)
现在我没有内存不足。我认为它是因为我存储
object.size(df)+ 2 * object.size(df.extension)
而rbind R需要
object.size(rbind(df,df.extension))+ object.size(df)+ object.size(df.extension)。
之后我使用
rm(df.extension)
gc(reset = TRUE)
释放我不再需要的内存
这解决了我现在的问题,但我觉得有更高级的方法来做一个高效的内存记录。我感谢对这个解决方案的任何意见。
I need to rbind two large data frames. Right now I use
df <- rbind(df, df.extension)
but I (almost) instantly run out of memory. I guess its because df is held in the memory twice. I might see even bigger data frames in the future, so I need some kind of in-place rbind.
So my question is: Is there a way to avoid data duplication in memory when using rbind?
I found this question, which uses SqlLite, but I really want to avoid using the hard drive as a cache.
Right now I worked out the following solution:
nextrow = nrow(df)+1
df[nextrow:(nextrow+nrow(df.extension)-1),] = df.extension
# we need to assure unique row names
row.names(df) = 1:nrow(df)
Now I don't run out of memory. I think its because I store
object.size(df) + 2 * object.size(df.extension)
while with rbind R would need
object.size(rbind(df,df.extension)) + object.size(df) + object.size(df.extension).
After that I use
rm(df.extension)
gc(reset=TRUE)
to free the memory I don't need anymore.
This solved my problem for now, but I feel that there is a more advanced way to do a memory efficient rbind. I appreciate any comments on this solution.
这篇关于内存高效的替代rbind - 就地rbind?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!