本文介绍了如何在不分裂列的情况下合并data.tables列表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将要合并大型数据集。这就是为什么我尝试使用data.table并对其速度感到兴奋的原因。

I'm about to merge large data sets. That's why I try out data.table and am thrilled by its speed.

# base R
system.time(
  M1 <- Reduce(function(...) merge(..., all=TRUE), L)
  )
# user  system elapsed
# 5.05    0.00    5.20

# data.table
library(data.table)
L.dt <- lapply(L, function(x) setkeyv(data.table(x), c("sid", "id")))
system.time(
  M2 <- Reduce(function(...) merge(..., all=TRUE), L.dt)
  )
# user  system elapsed
# 0.12    0.00    0.12

两种方法产生相同的值,但是有一些列被data.table拆分。

Both approaches yield the same values, however there are some columns that are split with data.table.

基本R:

set.seed(1)
car::some(M1, 5)
#        sid    id         V3        V4          a         b
# 60504    1 60504 -0.6964804 -1.210195         NA        NA
# 79653    1 79653 -2.5287163 -1.087546         NA        NA
# 111637   2 11637  0.7104236        NA -1.7377657        NA
# 171855   2 71855  0.2023342        NA -0.6334279        NA
# 272460   3 72460 -0.5098994        NA         NA 0.2738896

data.table:

set.seed(1)
car::some(M2, 5)
#    sid    id       V3.x        V4      V3.y          a         V3         b
# 1:   1 60504 -0.6964804 -1.210195        NA         NA         NA        NA
# 2:   1 79653 -2.5287163 -1.087546        NA         NA         NA        NA
# 3:   2 11637         NA        NA 0.7104236 -1.7377657         NA        NA
# 4:   2 71855         NA        NA 0.2023342 -0.6334279         NA        NA
# 5:   3 72460         NA        NA        NA         NA -0.5098994 0.2738896

我想念什么吗?还是有一种简单的方法来解决这个问题,即合并拆分列? (我不想使用任何其他软件包。)

Did I miss something? Or is there an easy way to solve this, i.e. get the split columns combined? (I don't want to use any other packages.)

数据

Data

fun <- function(x){
  set.seed(x)
  data.frame(cbind(sid=x, id=1:1e5, matrix(rnorm(1e5*2), 1e5)))
}
tmp <- lapply(1:3, fun)
df1 <- tmp[[1]]
df2 <- tmp[[2]]
df3 <- tmp[[3]]
rm(tmp)
names(df2)[4] <- c("a")
names(df3)[4] <- c("b")
L <- list(df1, df2, df3)

相关:

推荐答案

base :: merge 中的 by 参数默认为 intersect(names(x),names(y))其中 x y 是要合并的2个表。因此, base :: merge 也使用 V3 作为合并密钥。

The by argument in base::merge defaults to intersect(names(x), names(y)) where x and y are the 2 tables to be merged. Hence, base::merge also uses V3 as the merging key.

data.table :: merge 中的 by 参数默认为两者之间的共享键列表(即 sid id )。并且由于表中有名为 V3 的列,因此后缀会附加到新列中。

The by argument in data.table::merge defaults to the shared key columns between the two tables (i.e. sid and id in this case). And since the tables have columns named V3, suffixes are appended to the new columns.

因此,如果您的意图是要按所有公共列合并,可以标识公共列,设置键然后合并:

So if your intention is to merge by all common columns, you can identify the common columns, set keys then merge:

commcols <- Reduce(intersect, lapply(L, names))
L.dt <- lapply(L, function(x) setkeyv(data.table(x), commcols))
M2 <- Reduce(function(...) merge(..., all=TRUE), L.dt)

这篇关于如何在不分裂列的情况下合并data.tables列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-15 13:45