问题描述
我有一个数据框,如下所示:
Name Start_Date End_Date
A 2015-01-01 2019-12-29
A 2017-03-25 NA
A 2019-10-17 NA
A 2012-04-16 2015-01-09
A 2002-06-01 2006-02-01
A 2005-12-24 NA
B 2018-01-23 NA
我想创建一个列,如果两个观测具有相同的Name
,并且其中一个的Start_Date
在另一个观测的End_Date
内为±1年,则它们被归类为同一组。
所需输出:
Name Start_Date End_Date Wanted
A 2015-01-01 2019-12-29 1
A 2017-03-25 NA NA
A 2019-10-17 NA 1
A 2012-04-16 2015-01-09 1
A 2002-06-01 2006-02-01 2
A 2005-12-24 NA 2
B 2018-01-23 NA NA
我正在寻找使用数据表的解决方案,但解决我的问题就足够了。
新增:逐行说明
行:
- 开始日期比第4行的结束日期早8天(<;1年)。它与第4行在同一组中。
- 开始日期是第1行结束日期之后的两年多。与第1行不在同一组中。与第4行、第5行相同。它也与这两行不在同一组中。
- 开始日期比行%1的结束日期早2个月(<;1年)。它与行%1在同一组中。
- 参见第1行。
- 见下文。
- 开始日期比第5行的结束日期早3个月(<;1年)。它与第5行在同一组中。
- 没有其他名称B可比较。它在自己的组中。
1
、3
和4
行在同一组中。行5
和6
在同一组中。行2
和7
没有组。编辑:我已更新代码,以便在一个观察与另一个观察不匹配时具有一致的Wanted
类别。
推荐答案
接近
以下是data.table
的首选解决方案:
虽然dplyr
和fuzzyjoin
可能看起来更优雅,但对于足够大的数据集,它们的效率也可能较低。
igraph
在图中索引网络。在这里,网络是由data.frame
行组成的单独的Wanted
链和组,它们通过它们的Start_Date
和End_Date
之间的紧密程度连接在一起。这种方法似乎对于对transitive relationship请求的ℛhere进行建模是必要的还要注意保持ℛ的对称性(请参阅进一步阅读)。
我还包含了Flag
列,用于标记其Start_Date
与至少flag_at
其他行的End_Date
匹配的每一行;反之亦然。
解决方案
使用您的样本data.frame
,此处转载为my_data_frame
# Generate dataset as data.frame.
my_data_frame <- structure(list(Name = c("A", "A", "A", "A", "A", "A", "B"),
Start_Date = structure(c(16436, 17250, 18186, 15446, 11839, 13141, 17554),
class = "Date"),
End_Date = structure(c(18259, NA, NA, 16444, 13180, NA, NA),
class = "Date")),
row.names = c(NA, -7L),
class = "data.frame")
我们应用data.table
和igraph
(以及其他包)如下:
library(tidyverse)
library(data.table)
library(lubridate)
library(igraph)
# ...
# Code to generate your data.frame 'my_data_frame'.
# ...
# Treat dataset as a data.table.
my_data_table <- my_data_frame %>% data.table::as.data.table()
# Define the tolerance threshold as a (lubridate) "period": 1 year.
tolerance <- lubridate::years(1)
# Set the minimum number of matches for an row to be flagged: 2.
flag_at <- 2
#####################################
# BEGIN: Start Indexing the Groups. #
#####################################
# Begin indexing the "chain" (group) to which each "link" (row) belongs:
output <- my_data_table %>%
########################################################
# STEP 1: Link the Rows That Are "Close" to Each Other #
########################################################
# Prepare data.table for JOIN, by adding appropriate helper columns.
.[, `:=`(# Uniquely identify each row (by row number).
ID = .I,
# Boundary columns for tolerance threshold.
End_Low = End_Date - tolerance,
End_High = End_Date + tolerance)] %>%
# JOIN rows to each other, to obtain pairings.
.[my_data_table,
# Clearly describe the relation R: x R y whenever the 'Start_Date' of x is
# close enough to (within the boundary columns for) the 'End_Date' of y.
.(x.ID = i.ID, x.Name = i.Name, x.Start_Date = i.Start_Date, x.End_Date = i.End_Date,
y.End_Low = x.End_Low, y.End_High = x.End_High, y.ID = x.ID, y.Name = x.Name),
# JOIN criteria:
on = .(# Only pair rows having the same name.
Name,
# Only pair rows whose start and end dates are within the tolerance
# threshold of each other.
End_Low <= Start_Date,
End_High >= Start_Date),
# Make it an OUTER JOIN, to include those rows without a match.
nomatch = NA] %>%
# Prepare pairings for network analysis.
.[# Ensure no row is reflexively paired with itself.
# NOTE: This keeps the graph clean by trimming extraneous loops, and it
# prevents an "orphan" row from contributing to its own tally of matches.
!(x.ID == y.ID) %in% TRUE,
# !(x.ID == y.ID) %in% TRUE,
# Simplify the dataset to only the pairings (by ID) of linked rows.
.(from = x.ID, to = y.ID)]
#############################
# PAUSE: Count the Matches. #
#############################
# Count how many times each row has its 'End_Date' matched by a 'Start_Date'.
my_data_table$End_Matched <- output %>%
# Include again the missing IDs for y that were never matched by the JOIN.
.[my_data_table[, .(ID)], on = .(to = ID)] %>%
# For each row y, count every other row x where x R y.
.[, .(Matches = sum(!is.na(from))), by = to] %>%
# Extract the count column.
.$Matches
# Count how many times each row has its 'Start_Date' matched by an 'End_Date'.
my_data_table$Start_Matched <- output %>%
# For each row x, count every other row y where x R y.
.[, .(Matches = sum(!is.na(to))), by = from] %>%
# Extract the count column.
.$Matches
#########################################
# RESUME: Continue Indexing the Groups. #
#########################################
# Resume indexing:
output <- output %>%
# Ignore nonmatches (NAs) which are annoying to process into a graph.
.[from != to, ] %>%
###############################################################
# STEP 2: Index the Separate "Chains" Formed By Those "Links" #
###############################################################
# Convert pairings (by ID) of linked rows into an undirected graph.
igraph::graph_from_data_frame(directed = FALSE) %>%
# Find all groups (subgraphs) of transitively linked IDs.
igraph::components() %>%
# Pair each ID with its group index.
igraph::membership() %>%
# Tabulate those pairings...
utils::stack() %>% utils::type.convert(as.is = TRUE) %>%
# ...in a properly named data.table.
data.table::as.data.table() %>% .[, .(ID = ind, Group_Index = values)] %>%
#####################################################
# STEP 3: Match the Original Rows to their "Chains" #
#####################################################
# LEFT JOIN (on ID) to match each original row to its group index (if any).
.[my_data_table, on = .(ID)] %>%
# Transform output into final form.
.[# Sort into original order.
order(ID),
.(# Select existing columns.
Name, Start_Date, End_Date,
# Rename column having the group indices.
Wanted = Group_Index,
# Calculate column(s) to flag rows with sufficient matches.
Flag = (Start_Matched >= flag_at) | (End_Matched >= flag_at))]
# View results.
output
结果
结果output
如下data.table
:
Name Start_Date End_Date Wanted Flag
1: A 2015-01-01 2019-12-29 1 FALSE
2: A 2017-03-25 <NA> NA FALSE
3: A 2019-10-17 <NA> 1 FALSE
4: A 2012-04-16 2015-01-09 1 FALSE
5: A 2002-06-01 2006-02-01 2 FALSE
6: A 2005-12-24 <NA> 2 FALSE
7: B 2018-01-23 <NA> NA FALSE
请记住,Flag
都是FALSE
只是因为您的数据缺少任何Start_Date
与(至少)End_Date
匹配;以及任何End_Date
由(至少)两个Start_Date
匹配。
假设,如果我们将flag_at
降低到1
,则output
将Flag
每一行都有一个单个匹配(两个方向):
Name Start_Date End_Date Wanted Flag
1: A 2015-01-01 2019-12-29 1 TRUE
2: A 2017-03-25 <NA> NA FALSE
3: A 2019-10-17 <NA> 1 TRUE
4: A 2012-04-16 2015-01-09 1 TRUE
5: A 2002-06-01 2006-02-01 2 TRUE
6: A 2005-12-24 <NA> 2 TRUE
7: B 2018-01-23 <NA> NA FALSE
警告
由于某些data.table
operations修改by reference(或原地&q;),my_data_table
的值在整个工作流程中都会更改。第1步之后,my_data_table
变为
Name Start_Date End_Date ID End_Low End_High
1: A 2015-01-01 2019-12-29 1 2018-12-29 2020-12-29
2: A 2017-03-25 <NA> 2 <NA> <NA>
3: A 2019-10-17 <NA> 3 <NA> <NA>
4: A 2012-04-16 2015-01-09 4 2014-01-09 2016-01-09
5: A 2002-06-01 2006-02-01 5 2005-02-01 2007-02-01
6: A 2005-12-24 <NA> 6 <NA> <NA>
7: B 2018-01-23 <NA> 7 <NA> <NA>
与最初复制的my_data_frame
的结构背离。
由于dplyr
(在其他包中)是按值而不是按引用赋值,因此dplyr
解决方案将完全避开此问题。
但是,您在修改工作流时必须小心,因为在第1步之前可用的my_data_table
版本之后无法恢复
进一步阅读
虽然JOIN
的JOIN
是明确定向的,但此模型设法保留了您在此处描述的relational symmetry
通过使用undirected graph。
当JOIN
将第一行𝑥(具有)关联到第四行𝑦(具有End_Date
)时,我们认为𝑥足够接近𝑦(在1年内)。因此,我们在数学上称之为𝑥ℛ𝑦,或但是,converse𝑦ℛ𝑥不一定会出现在JOIN
数据中,因为的Start_Date
可能不会那么方便地落在的End_Date
附近。也就是说,JOIN
ed数据不一定指示在后一种情况下,严格的directed graph(";digraph";)将NOT捕获同一组中𝑥和𝑦的共同成员身份。您可以通过在步骤2的第一行设置directed = TRUE
igraph::graph_from_data_frame(directed = TRUE) %>%
并在下一行设置mode = "strong"
igraph::components(mode = "strong") %>%
要生成这些取消关联的结果:
Name Start_Date End_Date Wanted Flag
1: A 2015-01-01 2019-12-29 4 FALSE
2: A 2017-03-25 <NA> NA FALSE
3: A 2019-10-17 <NA> 3 FALSE
4: A 2012-04-16 2015-01-09 5 FALSE
5: A 2002-06-01 2006-02-01 2 FALSE
6: A 2005-12-24 <NA> 1 FALSE
7: B 2018-01-23 <NA> NA FALSE
相比之下,可以通过使用无向图(directed = FALSE
)或通过更宽松的标准(mode = "weak"
)对行进行正确分组。当JOIN
ed数据中存在𝑦ℛ𝑥时,这两种方法都将有效地模拟𝑥ℛ𝑦的存在。
此对称属性在对您描述的行为建模时尤为重要here:
在这种情况下,您希望模型识别任意两行𝑦和𝑧必须位于同一组(𝑦ℛ𝑧)中,只要它们的End_Date
与其他行𝑥:𝑦ℛ𝑥和𝑧ℛ𝑥的相同Start_Date
匹配。
我们可以使用
验证此结果my_data_frame <- my_data_frame %>%
rbind(list(Name = "A",
Start_Date = as.Date("2010-01-01"),
End_Date = as.Date("2015-01-05")))
要在工作流之前将第8行追加到my_data_frame
,请执行以下操作:
Name Start_Date End_Date
1 A 2015-01-01 2019-12-29
# ⋮ ⋮ ⋮ ⋮
4 A 2012-04-16 2015-01-09
# ⋮ ⋮ ⋮ ⋮
8 A 2010-01-01 2015-01-05
这第8行用作我们的𝑧,其中𝑥是第1行,𝑦是第4行,和以前一样。事实上,output
将𝑦和𝑧正确归类为属于同一组1
:𝑦ℛ𝑧。 Name Start_Date End_Date Wanted Flag
1: A 2015-01-01 2019-12-29 1 TRUE
2: A 2017-03-25 <NA> NA FALSE
3: A 2019-10-17 <NA> 1 FALSE
4: A 2012-04-16 2015-01-09 1 FALSE
5: A 2002-06-01 2006-02-01 2 FALSE
6: A 2005-12-24 <NA> 2 FALSE
7: B 2018-01-23 <NA> NA FALSE
8: A 2010-01-01 2015-01-05 1 FALSE
同样,output
正确地Flag
是第一行,其Start_Date
现在由第4行和第8行的两个End_Date
匹配。
这篇关于基于模糊标准创建组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!