问题描述
我有一些大型数据集,正在尝试 data.table
进行合并,同时对匹配行上的共享列求和。我知道如何使用LHS data.table中的 [
匹配行进行合并,如下所示与表 a2
:LHS和 a
:RHS
I have some large data sets and am trying out data.table
to combine them while summing up the shared column over matching rows. I know how to merge using [
matching rows in the LHS data.table as shown below with tables a2
:LHS and a
:RHS
a2 <- data.table( b= c(letters[1:5],letters[11:15]), c = as.integer(rep(100,10)))
a <- data.table(b = letters[1:10], c = as.integer(1:10))
setkey(a2 ,"b")
setkey(a , "b")
a2
b c
1: a 100
2: b 100
3: c 100
4: d 100
5: e 100
6: k 100
7: l 100
8: m 100
9: n 100
10: o 100
a
b c
1: a 1
2: b 2
3: c 3
4: d 4
5: e 5
6: f 6
7: g 7
8: h 8
9: i 9
10: j 10
从第二个答案开始我看到了如何将匹配行上的列求和,如下所示:
from second answer hereMerge data frames whilst summing common columns in R I saw how columns could be summed up over matching rows, as such:
setkey(a , "b")
setkey(a2, "b")
a2[a, `:=`(c = c + i.c)]
a2
b c
1: a 101
2: b 102
3: c 103
4: d 104
5: e 105
6: k 100
7: l 100
8: m 100
9: n 100
10: o 100
但是我试图保留不匹配的行。
However I am trying retain the rows that don't match as well.
或者,我可以使用 merge
,如下所示,但是我想要一个空的空间来制作一个有4行的新表减少到2行。
Alternately I could use merge
as shown below but I would like a void making a new table with 4 rows before reducing it to 2 rows.
c <- merge(a, a2, by = "b", all=T)
c <- transform(c, value = rowSums(c[,2:3], na.rm=T))
c <- c[,c(1,4)]
c
b value
1: a 102
2: b 104
3: c 106
4: d 108
5: e 110
6: f 6
7: g 7
8: h 8
9: i 9
10: j 10
11: k 100
12: l 100
13: m 100
14: n 100
15: o 100
最后一张表是我想要实现的,谢谢。
This last table is what I would like to achieve, Thanks in Advance.
推荐答案
合并可能不是很有效。由于您的两个 data.table
具有相同的结构,我建议 rbind
将它们放在一起并求和通过他们的钥匙。换句话说:
merge
is likely to not be very efficient for the end result you are after. Since both of your data.table
s have the same structure, I would suggest rbind
ing them together and taking the sum by their key. In other words:
rbindlist(list(a, a2))[, sum(c), b]
我使用了 rbindlist
,因为它通常在 rbind
data.table
s(即使您必须先放置 data.table
在列表中
)。
I've used rbindlist
because it is generally more efficient at rbind
ing data.table
s (even though you have to first put your data.table
s in a list
).
比较大型数据集的一些时间安排:
Compare some timings on larger datasets:
library(data.table)
library(stringi)
set.seed(1)
n <- 1e7; n2 <- 1e6
x <- stri_rand_strings(n, 4)
a2 <- data.table(b = sample(x, n2), c = sample(100, n2, TRUE))
a <- data.table(b = sample(x, n2), c = sample(10, n2, TRUE))
system.time(rbindlist(list(a, a2))[, sum(c), b])
# user system elapsed
# 0.83 0.05 0.87
system.time(merge(a2, a, by = "b", all = TRUE)[, rowSums(.SD, na.rm = TRUE), b]) # Get some coffee
# user system elapsed
# 159.58 0.48 162.95
## Do we have all the rows we expect to have?
length(unique(c(a$b, a2$b)))
# [1] 1782166
nrow(rbindlist(list(a, a2))[, sum(c), b])
# [1] 1782166
这篇关于合并data.tables并对共享列求和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!