是否有一个合并函数可以优先考虑公共(public)变量中的非缺失值?

考虑以下示例。

首先,我们生成两个具有相同 ID 的 data.frames,但特定变量上的缺失值互补:

set.seed(1)
missings  <- sample.int(6, 3)
df1  <- data.frame(ID = letters[1:6], V1 = NA)
df2  <- data.frame(ID = letters[1:6], V1 = NA)
df1$V1[missings]  <- rnorm(3)
df2$V1[setdiff(1:6, missings)]  <- rnorm(3)

应用 mergejoin 包中的任何 dplyr 函数会产生类似于以下的结果:
> merge(df1, df2, by = 'ID')
  ID      V1.x       V1.y
1  a        NA -1.5399500
2  b 1.3297993         NA
3  c 0.4146414         NA
4  d        NA -0.9285670
5  e        NA -0.2947204
6  f 1.2724293         NA

我们想以一种“更智能”的方式连接这两个 data.frames,当另一个 data.frame 中没有丢失时,它会忽略一个 data.frame 中的缺失值,以获得以下输出:
> output <- df1
> output$V1[is.na(df1$V1)]  <- df2$V1[!(is.na(df2$V1))]
> output
  ID         V1
1  a -1.5399500
2  b  1.3297993
3  c  0.4146414
4  d -0.9285670
5  e -0.2947204
6  f  1.2724293

我们可以假设 df1df2 具有完全互补的 V1 缺失值。

编辑

适用于任意数量变量的解决方案将是理想的。

最佳答案

感谢@Gregor 和@StevenBeaupré 非常有帮助的评论,我想出了一个使用 coalesce.na 包中的 kimisc 的解决方案,该解决方案扩展到任意数量的变量:

mapply(function(x,y) coalesce.na(x,y), df1$V1, df2$V1)
[1] -1.5399500  1.3297993  0.4146414 -0.9285670 -0.2947204  1.2724293

请注意, df1$V1df2$V1 可以替换变量列表,允许以下内容:
> set.seed(1)
> missings  <- sample.int(6, 3)
> df1  <- data.frame(ID = letters[1:6],
+                    V1 = NA,
+                    V2 = NA)
> df2  <- data.frame(ID = letters[1:6],
+                    V1 = NA,
+                    V2 = NA)
> df1$V1[missings]  <- rnorm(3)
> df2$V1[setdiff(1:6, missings)]  <- rnorm(3)
> df1$V2[setdiff(1:6, missings)]  <- rnorm(3)
> df2$V2[missings]  <- rnorm(3)

> cbind(df1, df2)
  ID        V1           V2 ID         V1         V2
1  a        NA -0.005767173  a -1.5399500         NA
2  b 1.3297993           NA  b         NA -0.7990092
3  c 0.4146414           NA  c         NA -0.2894616
4  d        NA  2.404653389  d -0.9285670         NA
5  e        NA  0.763593461  e -0.2947204         NA
6  f 1.2724293           NA  f         NA -1.1476570

> dfMerged <- merge(df1, df2, by = 'ID')
> xList <- dfMerged[grep("\\.x$", names(dfMerged))]
> yList <- dfMerged[grep("\\.y$", names(dfMerged))]

> mapply(function(x,y) coalesce.na(x,y), xList, yList)
           V1.x         V2.x
[1,] -1.5399500 -0.005767173
[2,]  1.3297993 -0.799009249
[3,]  0.4146414 -0.289461574
[4,] -0.9285670  2.404653389
[5,] -0.2947204  0.763593461
[6,]  1.2724293 -1.147657009

因此,完整的解决方案如下所示:
library(kimisc)
smartMergeList <- function(dfList, idVar) {
    merged <- Reduce(x = dfList,
                     f = function(x,y) merge(x, y, by = idVar, all = T))
    xList <- merged[grep("\\.x$", names(merged))]
    yList <- merged[grep("\\.y$", names(merged))]
    merged[names(xList)] <- mapply(function(x,y) coalesce.na(x,y),
                            xList, yList)
    merged[names(yList)] <- NULL
    merged
})

不过,我很想看到更漂亮的东西!

关于r - 合并/加入优先考虑非缺失值,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/37714533/

10-13 05:40