本文介绍了内存高效的替代rbind - 就地rbind?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要rbind两个大数据帧。现在我使用

  df<  -  rbind(df,df.extension)

但我(几乎)瞬间耗尽内存。我猜它是因为df被记录在内存中两次。我将来可能会看到更大的数据帧,所以我需要一些就地的rbind。



所以我的问题是:有没有办法在使用rbind时避免内存中的数据重复?



我发现这个问题,它使用SqlLite,但我真的希望避免使用硬盘作为缓存。

解决方案

以下解决方案:

  nextrow = nrow(df)+1 
df [nextrow:(nextrow + nrow(df .extension)-1),] = df.extension
#我们需要确保唯一的行名称
row.names(df)= 1:nrow(df)

现在我没有内存不足。我认为它是因为我存储

  object.size(df)+ 2 * object.size(df.extension)

而rbind R需要

  object.size(rbind(df,df.extension))+ object.size(df)+ object.size(df.extension)。 

之后我使用

  rm(df.extension)
gc(reset = TRUE)

释放我不再需要的内存



这解决了我现在的问题,但我觉得有更高级的方法来做一个高效的内存记录。我感谢对这个解决方案的任何意见。


I need to rbind two large data frames. Right now I use

df <- rbind(df, df.extension)

but I (almost) instantly run out of memory. I guess its because df is held in the memory twice. I might see even bigger data frames in the future, so I need some kind of in-place rbind.

So my question is: Is there a way to avoid data duplication in memory when using rbind?

I found this question, which uses SqlLite, but I really want to avoid using the hard drive as a cache.

解决方案

Right now I worked out the following solution:

nextrow = nrow(df)+1
df[nextrow:(nextrow+nrow(df.extension)-1),] = df.extension
# we need to assure unique row names
row.names(df) = 1:nrow(df)

Now I don't run out of memory. I think its because I store

object.size(df) + 2 * object.size(df.extension)

while with rbind R would need

object.size(rbind(df,df.extension)) + object.size(df) + object.size(df.extension). 

After that I use

rm(df.extension)
gc(reset=TRUE)

to free the memory I don't need anymore.

This solved my problem for now, but I feel that there is a more advanced way to do a memory efficient rbind. I appreciate any comments on this solution.

这篇关于内存高效的替代rbind - 就地rbind?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-22 05:45