问题描述
我的数据带有两列(可能更多)标识符(通常是长字符串)。这些有时会有所不同,输入错误或随时间变化。我想识别数据中的唯一主题。这需要识别通过其ID在某种程度上相关联的案例组。
I have data with two (potentially more) columns of identifiers (typically long strings). These differ sometimes, are mistyped, or change over time. I want to identify unique subjects in the data. This requires identifying groups of cases which are connected via their ids at some level.
示例
df <- data.frame(ida = c("A", "B", "C", "C", "D", "E"),
idb = c(1, 1, 3, 4, 4, 7),
trueid = c("id1", "id1", "id2", "id2", "id2", "id3"))
> df
ida idb trueid
1 A 1 id1
2 B 1 id1
3 C 3 id2
4 C 4 id2
5 D 4 id2
6 E 7 id3
id1
是 A, B,1
,表示 id2
C, D,3、4
和 id3
E,7
。
我不知道 trueid
,但需要使用 ida列中的信息进行查找
和 idb
。
I don't know trueid
but need to find it using the information from columns ida
and idb
.
该解决方案需要扩展到数以十万计的数百万个观测值数千个唯一ID。我已经在使用 data.table
。
The solution needs to scale to millions of observations with tens of thousands of unique ids. I am already using data.table
.
扩展:还有另一种情况有两列以上,有些列可能对其他列有帮助,即具有相同的标识符。我不知道哪些列对哪些内容有用。我认为类型可以忽略,但是所有列都是字符串或可以安全地转换。
Extension: There is another scenario where there are more than two columns, and some columns might be informative for others, i.e. have the same identifiers. I do not know which columns are informative for which. I think type can be disregarded though, all columns are strings or can safely be converted.
另一个示例:
df <- data.frame(ida = c("A", "B", "C", "C", "D", "E"),
idb = c("1", "2", "3", "4", "4", "7"),
idc = c("1", "1", "2", "3", "4", "5"),
idd = c("1", "A", "2", "3", "4", "5"),
trueid = c("id1", "id1", "id1", "id1", "id1", "id2"))
> df
ida idb idc idd trueid
1 A 1 1 1 id1
2 B 2 1 A id1
3 C 3 2 2 id1
4 C 4 3 3 id1
5 D 4 4 4 id1
6 E 7 5 5 id2
编辑:正如评论者所指出的,这本质上是在图中找到完整子图的问题。阅读更多内容后,我知道可以使用 library(igraph)
解决此问题。我将问题悬而未决,因为我希望使用依赖 base
, data.table
或<$ c的解决方案$ c> dplyr 。我无法轻松地在正在使用的服务器上安装软件包,安装 igraph
涉及很多繁琐的手续和延误。
As a commenter pointed out, this is essentially the clique problem of finding complete subgraphs in a graph. After reading a bit more, I understand this issue can be solved with library(igraph)
. I am leaving the question open as I would prefer a solution that relies on base
, data.table
or dplyr
. I cannot easily install packages on the server I am using, installing igraph
involves dealing with a lot of red tape and delays.
Edit2:对于阅读此书并面临类似问题的任何人: zx8754
使用igraph的答案要快得多(几个数量级)在具有更多组的更大(模拟)数据上。如果您有机会使用 igraph
,请这样做。
For anybody reading this and facing a similar problem: zx8754
's answer using igraph is considerably (several orders of magnitude) faster on larger (simulated) data with more groups. If you have the chance to use igraph
, do so.
推荐答案
这是使用 data.table
的递归方法:
#convert into a long format for easier processing
mDT <- melt(DT[, rn := .I], id.var="rn", variable.name="V", value.name="ID")[,
tid := NA_integer_]
#the recursive function
link <- function(ids, label) {
#identify the rows in DT containing ids and extract the IDs
newids <- mDT[mDT[.(ID=ids), on=.(ID), .(rn=rn)], on=.(rn), allow.cartesian=TRUE,
unique(ID)]
#update those rows to the same group
mDT[mDT[.(ID=ids), on=.(ID), .(rn=rn)], on=.(rn), tid := label]
if (length(setdiff(newids, ids)) > 0L) {
#call the recursive function if there are new ids
link(newids, label)
}
}
#get the first id that is not labelled yet
id <- mDT[is.na(tid), ID[1L]]
grp <- 1L
while(!is.na(id)) {
#use recursive function to link them up
link(id, grp)
#repeat for next id that is not part of any group yet
id <- mDT[is.na(tid), ID[1L]]
grp <- grp + 1L
}
#update original DT with tid
DT[mDT, on=.(rn), tid := tid]
数据:
library(data.table)
DT <- data.table(ida = c("A", "B", "C", "C", "D", "E"),
idb = c("1", "2", "3", "4", "4", "7"),
idc = c("1", "1", "2", "3", "4", "5"),
idd = c("1", "A", "2", "3", "4", "5"))
这篇关于在几列中查找唯一的一组标识符/组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!