问题描述
假设我在一个数据帧中有两组标识符 id1
和 id2
.如何创建新的标识符 id3
,其工作方式如下:
Suppose that I have two sets of identifiers id1
and id2
in a data frame. How can I create a new identifier id3
that works as follows:
我认为 id1
是更严格的键,因此,观测值首先会在 id1
中分组,然后在 id2
中分组.如果存在两组具有不同 id2
值的行,且其中某些元素具有相同的 id1
,则这两组行的 id3值应相同
( id3
中的确切值无关紧要).
I consider id1
as the stricter key, so that observations are first grouped in id1
and then in id2
. If there are two sets of rows with different values of id2
that have some of its elements with the same id1
, these two sets should have the same value for id3
(the exact value in id3
doesn't matter much).
df <- data.frame(id1 = c(1, 1, 2, 2, 5, 6),
id2 = c(4, 3, 1, 2, 2, 7),
id3 = c(1, 1, 2, 2, 2, 3))
第1行和第2行分组在一起,因为它们具有相同的 id1
.第3行,第4行和第5行分组在一起,因为第3行和第4行具有相同的 id1
,第4行和第5行具有相同的 id2
.
Rows 1 and 2 are grouped together because they have the same id1
. Rows 3, 4 and 5 are grouped together because 3 and 4 have the same id1
and 4 and 5 have the same id2
.
有人可以帮忙吗?我宁愿使用 dplyr
的解决方案来解决一般情况,其中 id
列中存在任意数量的可能值.
Can someone help? I would rather have a solution with dplyr
that encompasses a general case in which there is an arbitrary number of possible values in the id
columns.
推荐答案
这是一个图论问题.每个 id1
和 id2
是一个单独的节点,并且 df
给出了它们之间的链接.您正在查看每个id也属于哪些弱连接的群集.
This is a graph theory problem. Each id1
and id2
is a separate node and df
gives the links between them. You are looking to see which weakly connected clusters each id belongs too.
library(igraph)
df <- df %>% mutate(from = paste0('id1', '_', id1), to = paste0('id2', '_', id2))
dg <- graph_from_data_frame(df %>% select(from, to), directed = FALSE)
df <- df %>% mutate(id3 = components(dg)$membership[from])
df %>% select(id1, id2, id3)
#> id1 id2 id3
#> 1 1 4 1
#> 2 1 3 1
#> 3 2 1 2
#> 4 2 2 2
#> 5 5 2 2
#> 6 6 7 3
这篇关于如果两个不同的分组与dplyr不相交,如何合并的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!