



我有一个这样的 data.table:

I have a data.table like this:

col1   col2   col3  new
1       4     55    col1
2       3     44    col2
3       34    35    col2
4       44    87    col3

我想填充另一列 matched_value,其中包含 new 列中给出的相应列名中的值:

I want to populate another column matched_value that contains the values from the respective column names given in the new column:

col1   col2   col3  new    matched_value
1       4     55    col1        1
2       3     44    col2        3
3       34    35    col2        34
4       44    87    col3        87

例如,在第一行中,new 的值为 "col1",因此 matched_valuecol1 中的值,即 1.

E.g., in the first row, the value of new is "col1" so matched_value takes the value from col1, which is 1.

如何在 R 中对非常大的 data.table 有效地执行此操作?

How can I do this efficiently in R on a very large data.table?



DT[, newval := .SD[[.BY[[1]]]], by=new]

   col1 col2 col3  new newval
1:    1    4   55 col1      1
2:    2    3   44 col2      3
3:    3   34   35 col2     34
4:    4   44   87 col3     87

它是如何工作的. 这会根据 new 中的字符串将数据分成组.每个组的字符串值存储在 newname = .BY[[1]] 中.我们使用这个字符串通过.SD[[newname]]选择.SD的对应列..SD 代表 SData 的子集.

How it works. This splits the data into groups based on the strings in new. The value of the string for each group is stored in newname = .BY[[1]]. We use this string to select the corresponding column of .SD via .SD[[newname]]. .SD stands for Subset of Data.

Alternatives. get(.BY[[1]]) 应该可以代替 .SD[[.BY[[1]]]].根据@David 运行的基准测试,这两种方法同样快.

Alternatives. get(.BY[[1]]) should work just as well in place of .SD[[.BY[[1]]]]. According to a benchmark run by @David, the two ways are equally fast.


