合并data.tables并对共享列求和

本文介绍了合并data.tables并对共享列求和的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一些大型数据集，正在尝试 data.table 进行合并，同时对匹配行上的共享列求和。我知道如何使用LHS data.table中的 [匹配行进行合并，如下所示与表 a2 ：LHS和 a ：RHS

I have some large data sets and am trying out data.table to combine them while summing up the shared column over matching rows. I know how to merge using [ matching rows in the LHS data.table as shown below with tables a2:LHS and a:RHS

a2 <- data.table( b= c(letters[1:5],letters[11:15]), c = as.integer(rep(100,10)))
a <- data.table(b = letters[1:10], c = as.integer(1:10))
setkey(a2 ,"b")
setkey(a , "b")

a2
    b   c
 1: a 100
 2: b 100
 3: c 100
 4: d 100
 5: e 100
 6: k 100
 7: l 100
 8: m 100
 9: n 100
10: o 100

a
    b  c
 1: a  1
 2: b  2
 3: c  3
 4: d  4
 5: e  5
 6: f  6
 7: g  7
 8: h  8
 9: i  9
10: j 10

从第二个答案开始我看到了如何将匹配行上的列求和，如下所示：

from second answer hereMerge data frames whilst summing common columns in R I saw how columns could be summed up over matching rows, as such:

setkey(a , "b")
setkey(a2, "b")
a2[a, `:=`(c = c + i.c)]
a2
    b   c
 1: a 101
 2: b 102
 3: c 103
 4: d 104
 5: e 105
 6: k 100
 7: l 100
 8: m 100
 9: n 100
10: o 100

但是我试图保留不匹配的行。

However I am trying retain the rows that don't match as well.

或者，我可以使用 merge ，如下所示，但是我想要一个空的空间来制作一个有4行的新表减少到2行。

Alternately I could use merge as shown below but I would like a void making a new table with 4 rows before reducing it to 2 rows.

c <- merge(a, a2, by = "b", all=T)
c <- transform(c, value = rowSums(c[,2:3], na.rm=T))
c <- c[,c(1,4)]

c
    b value
 1: a   102
 2: b   104
 3: c   106
 4: d   108
 5: e   110
 6: f     6
 7: g     7
 8: h     8
 9: i     9
10: j    10
11: k   100
12: l   100
13: m   100
14: n   100
15: o   100

最后一张表是我想要实现的，谢谢。

This last table is what I would like to achieve, Thanks in Advance.

推荐答案

合并可能不是很有效。由于您的两个 data.table 具有相同的结构，我建议 rbind 将它们放在一起并求和通过他们的钥匙。换句话说：

merge is likely to not be very efficient for the end result you are after. Since both of your data.tables have the same structure, I would suggest rbinding them together and taking the sum by their key. In other words:

rbindlist(list(a, a2))[, sum(c), b]

我使用了 rbindlist ，因为它通常在 rbind data.table s（即使您必须先放置 data.table 在列表中）。

I've used rbindlist because it is generally more efficient at rbinding data.tables (even though you have to first put your data.tables in a list).

比较大型数据集的一些时间安排：

Compare some timings on larger datasets:

library(data.table)
library(stringi)
set.seed(1)
n <- 1e7; n2 <- 1e6
x <- stri_rand_strings(n, 4)
a2 <- data.table(b = sample(x, n2), c = sample(100, n2, TRUE))
a <- data.table(b = sample(x, n2), c = sample(10, n2, TRUE))

system.time(rbindlist(list(a, a2))[, sum(c), b])
#   user  system elapsed
#   0.83    0.05    0.87

system.time(merge(a2, a, by = "b", all = TRUE)[, rowSums(.SD, na.rm = TRUE), b]) # Get some coffee
#   user  system elapsed
# 159.58    0.48  162.95

## Do we have all the rows we expect to have?
length(unique(c(a$b, a2$b)))
# [1] 1782166

nrow(rbindlist(list(a, a2))[, sum(c), b])
# [1] 1782166

这篇关于合并data.tables并对共享列求和的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！