问题描述
我将要合并大型数据集。这就是为什么我尝试使用data.table并对其速度感到兴奋的原因。
I'm about to merge large data sets. That's why I try out data.table and am thrilled by its speed.
# base R
system.time(
M1 <- Reduce(function(...) merge(..., all=TRUE), L)
)
# user system elapsed
# 5.05 0.00 5.20
# data.table
library(data.table)
L.dt <- lapply(L, function(x) setkeyv(data.table(x), c("sid", "id")))
system.time(
M2 <- Reduce(function(...) merge(..., all=TRUE), L.dt)
)
# user system elapsed
# 0.12 0.00 0.12
两种方法产生相同的值,但是有一些列被data.table拆分。
Both approaches yield the same values, however there are some columns that are split with data.table.
基本R:
set.seed(1)
car::some(M1, 5)
# sid id V3 V4 a b
# 60504 1 60504 -0.6964804 -1.210195 NA NA
# 79653 1 79653 -2.5287163 -1.087546 NA NA
# 111637 2 11637 0.7104236 NA -1.7377657 NA
# 171855 2 71855 0.2023342 NA -0.6334279 NA
# 272460 3 72460 -0.5098994 NA NA 0.2738896
data.table:
set.seed(1)
car::some(M2, 5)
# sid id V3.x V4 V3.y a V3 b
# 1: 1 60504 -0.6964804 -1.210195 NA NA NA NA
# 2: 1 79653 -2.5287163 -1.087546 NA NA NA NA
# 3: 2 11637 NA NA 0.7104236 -1.7377657 NA NA
# 4: 2 71855 NA NA 0.2023342 -0.6334279 NA NA
# 5: 3 72460 NA NA NA NA -0.5098994 0.2738896
我想念什么吗?还是有一种简单的方法来解决这个问题,即合并拆分列? (我不想使用任何其他软件包。)
Did I miss something? Or is there an easy way to solve this, i.e. get the split columns combined? (I don't want to use any other packages.)
数据
Data
fun <- function(x){
set.seed(x)
data.frame(cbind(sid=x, id=1:1e5, matrix(rnorm(1e5*2), 1e5)))
}
tmp <- lapply(1:3, fun)
df1 <- tmp[[1]]
df2 <- tmp[[2]]
df3 <- tmp[[3]]
rm(tmp)
names(df2)[4] <- c("a")
names(df3)[4] <- c("b")
L <- list(df1, df2, df3)
相关: ,
推荐答案
base :: merge
中的 by
参数默认为 intersect(names(x),names(y))
其中 x
和 y
是要合并的2个表。因此, base :: merge
也使用 V3
作为合并密钥。
The by
argument in base::merge
defaults to intersect(names(x), names(y))
where x
and y
are the 2 tables to be merged. Hence, base::merge
also uses V3
as the merging key.
data.table :: merge
中的 by
参数默认为两者之间的共享键列表(即 sid
和 id
)。并且由于表中有名为 V3
的列,因此后缀会附加到新列中。
The by
argument in data.table::merge
defaults to the shared key columns between the two tables (i.e. sid
and id
in this case). And since the tables have columns named V3
, suffixes are appended to the new columns.
因此,如果您的意图是要按所有公共列合并,可以标识公共列,设置键然后合并:
So if your intention is to merge by all common columns, you can identify the common columns, set keys then merge:
commcols <- Reduce(intersect, lapply(L, names))
L.dt <- lapply(L, function(x) setkeyv(data.table(x), commcols))
M2 <- Reduce(function(...) merge(..., all=TRUE), L.dt)
这篇关于如何在不分裂列的情况下合并data.tables列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!