连接然后将列添加到现有的

连接然后将列添加到现有的

本文介绍了data.table 连接然后将列添加到现有的 data.frame 而无需重新复制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个 data.tables,X(3m 行,约 500 列)和 Y(100 行,两列).

set.seed(1)X 

我想在 X 上做一个左外连接,我可以通过 Y[X] 做到这一点,感谢:

为什么 data.tables 的 X[Y] 连接不允许完全外连接或左连接?

但我想将新列添加到 X 而不 复制 X (因为它很大).

显然,像 X <- Y[X] 这样的东西可以工作,但除非 data.table 比我认为的聪明得多(而且我认为归功于相当多的狡猾!),我相信这复制了整个X.

X[ , z:= Y[X,z]$z ] 有效,但很笨拙,不能很好地扩展到一列以上.

如何以有效的方式(无论是在副本方面还是在程序员时间方面)将合并结果存储回保留的 data.table 中?

解决方案

这很容易做到:

X[Y, z := i.z]

之所以有效,是因为这里 Y[X]X[Y] 之间的唯一区别是某些元素不在 Y 中,在这种情况下,您可能希望 zNA,而上面的赋值正是这样做的.

它也适用于许多变量:

X[Y, `:=`(z1 = i.z1, z2 = i.z2, ...)]

由于您需要操作 Y[X],因此您可以添加参数 nomatch=0(正如@mnel 指出的那样),以免为那些获得 NA其中 X 不包含来自 Y 的键值.即:

X[Y, z := i.z, nomatch=0]

来自 NEWS for data.table

 **************************************************** **** DATA.TABLE 版本 1.7.10 中的更改 **** **************************************************

新功能

o 前缀 i.现在可以在 j 中使用来指代继承的连接i 的列,否则会被 x 中的列掩盖同名.

I have two data.tables, X (3m rows by ~500 columns), and Y (100 rows by two columns).

set.seed(1)
X <- data.table( a=letters, b=letters, c=letters, g=sample(c(1:5,7),length(letters),replace=TRUE), key="g" )
Y <- data.table( z=runif(6), g=1:6, key="g" )

I want to do a left outer join on X, which I can do by Y[X] thanks to:

Why does X[Y] join of data.tables not allow a full outer join, or a left join?

But I want to add the new column to X without copying X (since it's huge).

Obviously, something like X <- Y[X] works, but unless data.table is far cleverer than I give it credit for (and I give it credit for quite a lot of deviousness!), I believe this copies the whole of X.

X[ , z:= Y[X,z]$z ] works, but is kludgy and doesn't scale well to more than one column.

How do I store the results of a merge back into the retained data.table in an efficient (both in terms of copies and in terms of programmer time) way?

解决方案

This is easy to do:

X[Y, z := i.z]

It works because the only difference between Y[X] and X[Y] here, is when some elements are not in Y, in which case presumably you'd want z to be NA, which the above assignment will exactly do.

It would also work just as well for many variables:

X[Y, `:=`(z1 = i.z1, z2 = i.z2, ...)]


Since you require the operation Y[X], you can add the argument nomatch=0 (as @mnel points out) so as to not get NAs for those where X doesn't contain the key values from Y. That is:

X[Y, z := i.z, nomatch=0]


From the NEWS for data.table

这篇关于data.table 连接然后将列添加到现有的 data.frame 而无需重新复制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-01 02:08