问题描述
我有两个 data.tables
,X(3m 行,约 500 列)和 Y(100 行,两列).
set.seed(1)X
我想在 X 上做一个左外连接,我可以通过 Y[X]
做到这一点,感谢:
为什么 data.tables 的 X[Y] 连接不允许完全外连接或左连接?
但我想将新列添加到 X
而不 复制 X
(因为它很大).
显然,像 X <- Y[X]
这样的东西可以工作,但除非 data.table
比我认为的聪明得多(而且我认为归功于相当多的狡猾!),我相信这复制了整个X
.
X[ , z:= Y[X,z]$z ]
有效,但很笨拙,不能很好地扩展到一列以上.
如何以有效的方式(无论是在副本方面还是在程序员时间方面)将合并结果存储回保留的 data.table 中?
这很容易做到:
X[Y, z := i.z]
之所以有效,是因为这里 Y[X]
和 X[Y]
之间的唯一区别是某些元素不在 Y
中,在这种情况下,您可能希望 z
为 NA
,而上面的赋值正是这样做的.
它也适用于许多变量:
X[Y, `:=`(z1 = i.z1, z2 = i.z2, ...)]
由于您需要操作 Y[X]
,因此您可以添加参数 nomatch=0
(正如@mnel 指出的那样),以免为那些获得 NA其中 X 不包含来自 Y 的键值.即:
X[Y, z := i.z, nomatch=0]
**************************************************** **** DATA.TABLE 版本 1.7.10 中的更改 **** **************************************************
新功能
o 前缀 i.现在可以在 j 中使用来指代继承的连接i 的列,否则会被 x 中的列掩盖同名.
I have two data.tables
, X (3m rows by ~500 columns), and Y (100 rows by two columns).
set.seed(1)
X <- data.table( a=letters, b=letters, c=letters, g=sample(c(1:5,7),length(letters),replace=TRUE), key="g" )
Y <- data.table( z=runif(6), g=1:6, key="g" )
I want to do a left outer join on X, which I can do by Y[X]
thanks to:
Why does X[Y] join of data.tables not allow a full outer join, or a left join?
But I want to add the new column to X
without copying X
(since it's huge).
Obviously, something like X <- Y[X]
works, but unless data.table
is far cleverer than I give it credit for (and I give it credit for quite a lot of deviousness!), I believe this copies the whole of X
.
X[ , z:= Y[X,z]$z ]
works, but is kludgy and doesn't scale well to more than one column.
How do I store the results of a merge back into the retained data.table in an efficient (both in terms of copies and in terms of programmer time) way?
This is easy to do:
X[Y, z := i.z]
It works because the only difference between Y[X]
and X[Y]
here, is when some elements are not in Y
, in which case presumably you'd want z
to be NA
, which the above assignment will exactly do.
It would also work just as well for many variables:
X[Y, `:=`(z1 = i.z1, z2 = i.z2, ...)]
Since you require the operation Y[X]
, you can add the argument nomatch=0
(as @mnel points out) so as to not get NAs for those where X doesn't contain the key values from Y. That is:
X[Y, z := i.z, nomatch=0]
From the NEWS for data.table
这篇关于data.table 连接然后将列添加到现有的 data.frame 而无需重新复制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!