本文介绍了合并data.tables并对共享列求和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些大型数据集,正在尝试 data.table 进行合并,同时对匹配行上的共享列求和。我知道如何使用LHS data.table中的 [匹配行进行合并,如下所示与表 a2 :LHS和 a :RHS

I have some large data sets and am trying out data.table to combine them while summing up the shared column over matching rows. I know how to merge using [ matching rows in the LHS data.table as shown below with tables a2:LHS and a:RHS

a2 <- data.table( b= c(letters[1:5],letters[11:15]), c = as.integer(rep(100,10)))
a <- data.table(b = letters[1:10], c = as.integer(1:10))
setkey(a2 ,"b")
setkey(a , "b")

a2
    b   c
 1: a 100
 2: b 100
 3: c 100
 4: d 100
 5: e 100
 6: k 100
 7: l 100
 8: m 100
 9: n 100
10: o 100

a
    b  c
 1: a  1
 2: b  2
 3: c  3
 4: d  4
 5: e  5
 6: f  6
 7: g  7
 8: h  8
 9: i  9
10: j 10

从第二个答案开始我看到了如何将匹配行上的列求和,如下所示:

from second answer hereMerge data frames whilst summing common columns in R I saw how columns could be summed up over matching rows, as such:

setkey(a , "b")
setkey(a2, "b")
a2[a, `:=`(c = c + i.c)]
a2
    b   c
 1: a 101
 2: b 102
 3: c 103
 4: d 104
 5: e 105
 6: k 100
 7: l 100
 8: m 100
 9: n 100
10: o 100

但是我试图保留不匹配的行。

However I am trying retain the rows that don't match as well.

或者,我可以使用 merge ,如下所示,但是我想要一个空的空间来制作一个有4行的新表减少到2行。

Alternately I could use merge as shown below but I would like a void making a new table with 4 rows before reducing it to 2 rows.

c <- merge(a, a2, by = "b", all=T)
c <- transform(c, value = rowSums(c[,2:3], na.rm=T))
c <- c[,c(1,4)]

c
    b value
 1: a   102
 2: b   104
 3: c   106
 4: d   108
 5: e   110
 6: f     6
 7: g     7
 8: h     8
 9: i     9
10: j    10
11: k   100
12: l   100
13: m   100
14: n   100
15: o   100

最后一张表是我想要实现的,谢谢。

This last table is what I would like to achieve, Thanks in Advance.

推荐答案

合并可能不是很有效。由于您的两个 data.table 具有相同的结构,我建议 rbind 将它们放在一起并求和通过他们的钥匙。换句话说:

merge is likely to not be very efficient for the end result you are after. Since both of your data.tables have the same structure, I would suggest rbinding them together and taking the sum by their key. In other words:

rbindlist(list(a, a2))[, sum(c), b]

我使用了 rbindlist ,因为它通常在 rbind data.table s(即使您必须先放置 data.table 列表中)。

I've used rbindlist because it is generally more efficient at rbinding data.tables (even though you have to first put your data.tables in a list).

比较大型数据集的一些时间安排:

Compare some timings on larger datasets:

library(data.table)
library(stringi)
set.seed(1)
n <- 1e7; n2 <- 1e6
x <- stri_rand_strings(n, 4)
a2 <- data.table(b = sample(x, n2), c = sample(100, n2, TRUE))
a <- data.table(b = sample(x, n2), c = sample(10, n2, TRUE))

system.time(rbindlist(list(a, a2))[, sum(c), b])
#   user  system elapsed
#   0.83    0.05    0.87

system.time(merge(a2, a, by = "b", all = TRUE)[, rowSums(.SD, na.rm = TRUE), b]) # Get some coffee
#   user  system elapsed
# 159.58    0.48  162.95

## Do we have all the rows we expect to have?
length(unique(c(a$b, a2$b)))
# [1] 1782166

nrow(rbindlist(list(a, a2))[, sum(c), b])
# [1] 1782166

这篇关于合并data.tables并对共享列求和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-30 03:09