问题描述
是否可以从对象(数据框,矩阵,向量)中选择子集,而无需复制所选数据?
Is there a way to select a subset from objects (data frames, matrices, vectors) without making a copy of selected data?
我使用相当大的数据集,但是从不更改它们.但是,通常为了方便起见,我选择要操作的数据子集.每次创建一个大子集的副本都是非常低效的内存,但是普通索引和subset
(因此是xapply()
函数族)都创建所选数据的副本.因此,我正在寻找可以克服此问题的功能或数据结构.
I work with quite large data sets, but never change them. However often for convenience I select subsets of the data to operate on. Making a copy of a large subset each time is very memory inefficient, but both normal indexing and subset
(and thus xapply()
family of functions) create copies of selected data. So I'm looking for functions or data structures that can overcome this issue.
一些可能满足我的需求的方法可能希望在某些R包中实现:
Some possible approaches that may fit my needs and hopefully are implemented in some R packages:
- 写时复制机制,即仅当您添加或重写现有元素时才复制的数据结构;
- 不可变的数据结构,它只需要重新创建该数据结构的索引信息,而无需重新创建其索引内容(例如,通过仅创建一个保留长度和指向相同指针的小对象来从字符串中创建子字符串) char数组);
-
xapply()
类似物,不会创建子集.
- copy-on-write mechanism, i.e. data structures that are copied only when you add or rewrite existing elements;
- immutable data structures, that only require recreating indexing information for the data structure, but not its content (like making substring from the string by only creating small object that holds length and a pointer to the same char array);
xapply()
analogues that do not create subsets.
推荐答案
尝试使用包参考.具体来说,是其refdata
类.
Try package ref. Specifically, its refdata
class.
关于data.table
您可能会缺少的是,在对(by=
参数)进行分组时,不会复制数据的子集,因此速度很快. [从技术上讲,它们只是位于共享的内存区域中,可用于每个组,并使用memcpy复制,这比C中的R循环快得多.]
What you might be missing about data.table
is that when grouping (by=
parameter) the subsets of data are not copied, so that's fast. [Well technically they are but into a shared area of memory which is reused for each group, and copied using memcpy which is much faster than R's for loops in C.]
:=
是就地修改data.table
的一种方法. data.table
与通常的R编程风格不同,它不是不是写时复制的.用户必须显式调用copy()
来复制一个表(可能非常大),即使在函数内也是如此.
:=
in data.table
is one way to modify a data.table
in place. data.table
departs from usual R programming style in that it is not copied-on-write. User has to call copy()
explicitly to copy a (potentially very large) table, even within a function.
您是对的,data.table
中没有内置类似refdata
的机制.我明白您的意思,这将是一个不错的功能.但是,refdata
应该可以在data.table
上运行,并且data.frame
可能还不错(但是请确保使用tracemem(DF)
监视副本).
You're right that there isn't a mechanism like refdata
built into data.table
. I see what you mean and it would be a nice feature. refdata
should work on a data.table
, though, and you might be fine with data.frame
(but be sure to monitor copies with tracemem(DF)
).
您还可以尝试在软件包plyr
中添加idata.frame
(不可变的data.frame
).
There is also idata.frame
(immutable data.frame
) in package plyr
you could try.
这篇关于R:选择子集而不复制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!