选择子集而不复制

选择子集而不复制

本文介绍了R:选择子集而不复制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以从对象(数据框,矩阵,向量)中选择子集,而无需复制所选数据?

Is there a way to select a subset from objects (data frames, matrices, vectors) without making a copy of selected data?

我使用相当大的数据集,但是从不更改它们.但是,通常为了方便起见,我选择要操作的数据子集.每次创建一个大子集的副本都是非常低效的内存,但是普通索引和subset(因此是xapply()函数族)都创建所选数据的副本.因此,我正在寻找可以克服此问题的功能或数据结构.

I work with quite large data sets, but never change them. However often for convenience I select subsets of the data to operate on. Making a copy of a large subset each time is very memory inefficient, but both normal indexing and subset (and thus xapply() family of functions) create copies of selected data. So I'm looking for functions or data structures that can overcome this issue.

一些可能满足我的需求的方法可能希望在某些R包中实现:

Some possible approaches that may fit my needs and hopefully are implemented in some R packages:

  • 写时复制机制,即仅当您添加或重写现有元素时才复制的数据结构;
  • 不可变的数据结构,它只需要重新创建该数据结构的索引信息,而无需重新创建其索引内容(例如,通过仅创建一个保留长度和指向相同指针的小对象来从字符串中创建子字符串) char数组);
  • xapply() 类似物,不会创建子集.
  • copy-on-write mechanism, i.e. data structures that are copied only when you add or rewrite existing elements;
  • immutable data structures, that only require recreating indexing information for the data structure, but not its content (like making substring from the string by only creating small object that holds length and a pointer to the same char array);
  • xapply() analogues that do not create subsets.

推荐答案

尝试使用包参考.具体来说,是其refdata类.

Try package ref. Specifically, its refdata class.

关于data.table您可能会缺少的是,在对(by=参数)进行分组时,不会复制数据的子集,因此速度很快. [从技术上讲,它们只是位于共享的内存区域中,可用于每个组,并使用memcpy复制,这比C中的R循环快得多.]

What you might be missing about data.table is that when grouping (by= parameter) the subsets of data are not copied, so that's fast. [Well technically they are but into a shared area of memory which is reused for each group, and copied using memcpy which is much faster than R's for loops in C.]

:=是就地修改data.table的一种方法. data.table与通常的R编程风格不同,它不是不是写时复制的.用户必须显式调用copy()来复制一个表(可能非常大),即使在函数内也是如此.

:= in data.table is one way to modify a data.table in place. data.table departs from usual R programming style in that it is not copied-on-write. User has to call copy() explicitly to copy a (potentially very large) table, even within a function.

您是对的,data.table中没有内置类似refdata的机制.我明白您的意思,这将是一个不错的功能.但是,refdata应该可以在data.table上运行,并且data.frame可能还不错(但是请确保使用tracemem(DF)监视副本).

You're right that there isn't a mechanism like refdata built into data.table. I see what you mean and it would be a nice feature. refdata should work on a data.table, though, and you might be fine with data.frame (but be sure to monitor copies with tracemem(DF)).

您还可以尝试在软件包plyr中添加idata.frame(不可变的data.frame).

There is also idata.frame (immutable data.frame) in package plyr you could try.

这篇关于R:选择子集而不复制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-02 22:17