问题描述
这是我第一次在这里发帖,所以请善待;-)
This is my first time posting here, so please be kind ;-)
编辑在我有机会做出建议的更改之前,我的问题已关闭.所以我现在正在努力做得更好,感谢到目前为止所有回答的人!
EDITMy question was closed before I had a chance to make the changes suggested to me. So I'm trying to do a better job now, thanks for everyone that answered so far!
如何识别数据帧 x.1
中不包含在数据帧 x.2
中的记录/行基于 所有属性(即所有列)以最有效的方式可用?
How can I identify records/rows in data frame x.1
that are not contained in data frame x.2
based on all attributes available (i.e. all columns) in the most efficient way?
> x.1 <- data.frame(a=c(1,2,3,4,5), b=c(1,2,3,4,5))
> x.1
a b
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
> x.2 <- data.frame(a=c(1,1,2,3,4), b=c(1,1,99,3,4))
> x.2
a b
1 1 1
2 1 1
3 2 99
4 3 3
5 4 4
想要的结果
a b
2 2 2
5 5 5
目前最好的解决方案
作者:Brian Ripley 教授和 Gabor Grothendieck
BEST SOLUTION SO FAR
by Prof. Brian Ripley and Gabor Grothendieck
> fun.12 <- function(x.1,x.2,...){
+ x.1p <- do.call("paste", x.1)
+ x.2p <- do.call("paste", x.2)
+ x.1[! x.1p %in% x.2p, ]
+ }
> fun.12(x.1,x.2)
a b
2 2 2
5 5 5
> sol.12 <- microbenchmark(fun.12(x.1,x.2))
> sol.12 <- median(sol.12$time)/1000000000
> sol.12
> [1] 0.000207784
迄今为止测试过的所有解决方案的集合可在我的 博客
这是封装在函数mergeX()"中的最佳解决方案:
Here's the best solution wrapped into a function 'mergeX()':
setGeneric(
name="mergeX",
signature=c("src.1", "src.2"),
def=function(
src.1,
src.2,
...
){
standardGeneric("mergeX")
}
)
setMethod(
f="mergeX",
signature=signature(src.1="data.frame", src.2="data.frame"),
definition=function(
src.1,
src.2,
do.inverse=FALSE,
...
){
if(!do.inverse){
out <- merge(x=src.1, y=src.2, ...)
} else {
if("by.y" %in% names(list(...))){
src.2.0 <- src.2
src.2 <- src.1
src.1 <- src.2.0
}
src.1p <- do.call("paste", src.1)
src.2p <- do.call("paste", src.2)
out <- src.1[! src.1p %in% src.2p, ]
}
return(out)
}
)
推荐答案
这里有几个方法.#1 和 #4 假设 x.1
的行是唯一的.(如果 x.1
的行不是唯一的,那么它们将只返回重复行中的一个重复项.)其他的返回所有重复项:
Here are a few ways. #1 and #4 assume that the rows of x.1
are unique. (If rows of x.1
are not unique then they will return only one of the duplicates among the duplicated rows.) The others return all duplicates:
# 1
x.1[!duplicated(rbind(x.2, x.1))[-(1:nrow(x.2))],]
# 2
do.call("rbind", setdiff(split(x.1, rownames(x.1)), split(x.2, rownames(x.2))))
# 3
x.1p <- do.call("paste", x.1)
x.2p <- do.call("paste", x.2)
x.1[! x.1p %in% x.2p, ]
# 4
library(sqldf)
sqldf("select * from `x.1` except select * from `x.2`")
x.1 和 x.2 被交换,这已得到修复.也更正了开头的限制注释.
x.1 and x.2 were swapped and this has been fixed. Also have corrected note on limitations at the beginning.
这篇关于识别数据框 A 中未包含在数据框 B 中的记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!