本文介绍了基于模糊标准创建组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,如下所示:

Name   Start_Date   End_Date
A      2015-01-01   2019-12-29
A      2017-03-25   NA
A      2019-10-17   NA
A      2012-04-16   2015-01-09
A      2002-06-01   2006-02-01
A      2005-12-24   NA
B      2018-01-23   NA

我想创建一个列,如果两个观测具有相同的Name,并且其中一个的Start_Date在另一个观测的End_Date内为±1年,则它们被归类为同一组。

所需输出:

Name   Start_Date   End_Date    Wanted
A      2015-01-01   2019-12-29  1
A      2017-03-25   NA          NA
A      2019-10-17   NA          1
A      2012-04-16   2015-01-09  1
A      2002-06-01   2006-02-01  2
A      2005-12-24   NA          2
B      2018-01-23   NA          NA

我正在寻找使用数据表的解决方案,但解决我的问题就足够了。

新增:逐行说明
行:

  1. 开始日期比第4行的结束日期早8天(<;1年)。它与第4行在同一组中。
  2. 开始日期是第1行结束日期之后的两年多。与第1行不在同一组中。与第4行、第5行相同。它也与这两行不在同一组中。
  3. 开始日期比行%1的结束日期早2个月(<;1年)。它与行%1在同一组中。
  4. 参见第1行。
  5. 见下文。
  6. 开始日期比第5行的结束日期早3个月(<;1年)。它与第5行在同一组中。
  7. 没有其他名称B可比较。它在自己的组中。
因此,134行在同一组中。行56在同一组中。行27没有组。

编辑:我已更新代码,以便在一个观察与另一个观察不匹配时具有一致的Wanted类别。

推荐答案

接近

以下是data.table的首选解决方案:

虽然dplyrfuzzyjoin可能看起来更优雅,但对于足够大的数据集,它们的效率也可能较低。

功劳归功于ThomasIsCoding抢先于this other questionan answer利用igraph在图中索引网络。在这里,网络是由data.frame行组成的单独的Wanted链和组,它们通过它们的Start_DateEnd_Date之间的紧密程度连接在一起。这种方法似乎对于对transitive relationship请求的ℛhere

进行建模是必要的

还要注意保持ℛ的对称性(请参阅进一步阅读)。

根据same request

和您的further clarification

我还包含了Flag列,用于标记其Start_Date与至少flag_at其他行的End_Date匹配的每一行;反之亦然。


解决方案

使用您的样本data.frame,此处转载为my_data_frame

# Generate dataset as data.frame.
my_data_frame <- structure(list(Name = c("A", "A", "A", "A", "A", "A", "B"),
                                Start_Date = structure(c(16436, 17250, 18186, 15446, 11839, 13141, 17554),
                                                       class = "Date"),
                                End_Date = structure(c(18259, NA, NA, 16444, 13180, NA, NA),
                                                     class = "Date")),
                           row.names = c(NA, -7L),
                           class = "data.frame")

我们应用data.tableigraph(以及其他包)如下:

library(tidyverse)
library(data.table)
library(lubridate)
library(igraph)



# ...
# Code to generate your data.frame 'my_data_frame'.
# ...



# Treat dataset as a data.table.
my_data_table <- my_data_frame %>% data.table::as.data.table()


# Define the tolerance threshold as a (lubridate) "period": 1 year.
tolerance <- lubridate::years(1)

# Set the minimum number of matches for an row to be flagged: 2.
flag_at <- 2



#####################################
# BEGIN: Start Indexing the Groups. #
#####################################

# Begin indexing the "chain" (group) to which each "link" (row) belongs:
output <- my_data_table %>%
  
  ########################################################
  # STEP 1: Link the Rows That Are "Close" to Each Other #
  ########################################################
  
  # Prepare data.table for JOIN, by adding appropriate helper columns.
  .[, `:=`(# Uniquely identify each row (by row number).
           ID = .I,
           # Boundary columns for tolerance threshold.
           End_Low = End_Date - tolerance,
           End_High = End_Date + tolerance)] %>%
    
  # JOIN rows to each other, to obtain pairings.
  .[my_data_table,
    # Clearly describe the relation R: x R y whenever the 'Start_Date' of x is
    # close enough to (within the boundary columns for) the 'End_Date' of y.
    .(x.ID = i.ID, x.Name = i.Name, x.Start_Date = i.Start_Date, x.End_Date = i.End_Date,
      y.End_Low = x.End_Low, y.End_High = x.End_High, y.ID = x.ID, y.Name = x.Name),
    # JOIN criteria:
    on = .(# Only pair rows having the same name.
           Name,
           # Only pair rows whose start and end dates are within the tolerance
           # threshold of each other.
           End_Low <= Start_Date,
           End_High >= Start_Date),
    # Make it an OUTER JOIN, to include those rows without a match.
    nomatch = NA] %>%
  
  # Prepare pairings for network analysis.
  .[# Ensure no row is reflexively paired with itself.
    #   NOTE: This keeps the graph clean by trimming extraneous loops, and it
    #   prevents an "orphan" row from contributing to its own tally of matches.
    !(x.ID == y.ID) %in% TRUE,
    # !(x.ID == y.ID) %in% TRUE,
    # Simplify the dataset to only the pairings (by ID) of linked rows.
    .(from = x.ID, to = y.ID)]



#############################
# PAUSE: Count the Matches. #
#############################

# Count how many times each row has its 'End_Date' matched by a 'Start_Date'.
my_data_table$End_Matched <- output %>%
  
  # Include again the missing IDs for y that were never matched by the JOIN.
  .[my_data_table[, .(ID)], on = .(to = ID)] %>%
  
  # For each row y, count every other row x where x R y.
  .[, .(Matches = sum(!is.na(from))), by = to] %>%
  
  # Extract the count column.
  .$Matches


# Count how many times each row has its 'Start_Date' matched by an 'End_Date'.
my_data_table$Start_Matched <- output %>%
  
  # For each row x, count every other row y where x R y.
  .[, .(Matches = sum(!is.na(to))), by = from] %>%
  
  # Extract the count column.
  .$Matches



#########################################
# RESUME: Continue Indexing the Groups. #
#########################################

# Resume indexing:
output <- output %>%
  
  # Ignore nonmatches (NAs) which are annoying to process into a graph.
  .[from != to, ] %>%
  
  ###############################################################
  # STEP 2: Index the Separate "Chains" Formed By Those "Links" #
  ###############################################################
  
  # Convert pairings (by ID) of linked rows into an undirected graph.
  igraph::graph_from_data_frame(directed = FALSE) %>%
  
  # Find all groups (subgraphs) of transitively linked IDs.
  igraph::components() %>%
  
  # Pair each ID with its group index.
  igraph::membership() %>%
  
  # Tabulate those pairings...
  utils::stack() %>% utils::type.convert(as.is = TRUE) %>%
  
  # ...in a properly named data.table.
  data.table::as.data.table() %>% .[, .(ID = ind, Group_Index = values)] %>%
  
  
  
  #####################################################
  # STEP 3: Match the Original Rows to their "Chains" #
  #####################################################
  
  # LEFT JOIN (on ID) to match each original row to its group index (if any).
  .[my_data_table, on = .(ID)] %>%
  
  # Transform output into final form.
  .[# Sort into original order.
    order(ID),
    .(# Select existing columns.
      Name, Start_Date, End_Date,
      # Rename column having the group indices.
      Wanted = Group_Index,
      # Calculate column(s) to flag rows with sufficient matches.
      Flag = (Start_Matched >= flag_at) | (End_Matched >= flag_at))]



# View results.
output

结果

结果output如下data.table

   Name Start_Date   End_Date Wanted  Flag
1:    A 2015-01-01 2019-12-29      1 FALSE
2:    A 2017-03-25       <NA>     NA FALSE
3:    A 2019-10-17       <NA>      1 FALSE
4:    A 2012-04-16 2015-01-09      1 FALSE
5:    A 2002-06-01 2006-02-01      2 FALSE
6:    A 2005-12-24       <NA>      2 FALSE
7:    B 2018-01-23       <NA>     NA FALSE

请记住,Flag都是FALSE只是因为您的数据缺少任何Start_Date与(至少)End_Date匹配;以及任何End_Date由(至少)两个Start_Date匹配。

假设,如果我们将flag_at降低到1,则outputFlag每一行都有一个单个匹配(两个方向):

   Name Start_Date   End_Date Wanted  Flag
1:    A 2015-01-01 2019-12-29      1  TRUE
2:    A 2017-03-25       <NA>     NA FALSE
3:    A 2019-10-17       <NA>      1  TRUE
4:    A 2012-04-16 2015-01-09      1  TRUE
5:    A 2002-06-01 2006-02-01      2  TRUE
6:    A 2005-12-24       <NA>      2  TRUE
7:    B 2018-01-23       <NA>     NA FALSE

警告

由于某些data.tableoperations修改by reference(或原地&q;),my_data_table的值在整个工作流程中都会更改。第1步之后,my_data_table变为

   Name Start_Date   End_Date ID    End_Low   End_High
1:    A 2015-01-01 2019-12-29  1 2018-12-29 2020-12-29
2:    A 2017-03-25       <NA>  2       <NA>       <NA>
3:    A 2019-10-17       <NA>  3       <NA>       <NA>
4:    A 2012-04-16 2015-01-09  4 2014-01-09 2016-01-09
5:    A 2002-06-01 2006-02-01  5 2005-02-01 2007-02-01
6:    A 2005-12-24       <NA>  6       <NA>       <NA>
7:    B 2018-01-23       <NA>  7       <NA>       <NA>

与最初复制的my_data_frame的结构背离。

由于dplyr(在其他包中)是按值而不是按引用赋值,因此dplyr解决方案将完全避开此问题。

但是,您在修改工作流时必须小心,因为在第1步之前可用的my_data_table版本之后无法恢复

进一步阅读

虽然JOINJOIN是明确定向的,但此模型设法保留了您在此处描述的relational symmetry

通过使用undirected graph

JOIN将第一行𝑥(具有)关联到第四行𝑦(具有End_Date)时,我们认为𝑥足够接近𝑦(在1年内)。因此,我们在数学上称之为𝑥ℛ𝑦,或

但是,converse𝑦ℛ𝑥不一定会出现在JOIN数据中,因为的Start_Date可能不会那么方便地落在的End_Date附近。也就是说,JOINed数据不一定指示

在后一种情况下,严格的directed graph(";digraph";)将NOT捕获同一组中𝑥和𝑦的共同成员身份。您可以通过在步骤2的第一行设置directed = TRUE

来观察这种不和谐的差异
  igraph::graph_from_data_frame(directed = TRUE) %>%

并在下一行设置mode = "strong"

  igraph::components(mode = "strong") %>%

要生成这些取消关联的结果:

   Name Start_Date   End_Date Wanted  Flag
1:    A 2015-01-01 2019-12-29      4 FALSE
2:    A 2017-03-25       <NA>     NA FALSE
3:    A 2019-10-17       <NA>      3 FALSE
4:    A 2012-04-16 2015-01-09      5 FALSE
5:    A 2002-06-01 2006-02-01      2 FALSE
6:    A 2005-12-24       <NA>      1 FALSE
7:    B 2018-01-23       <NA>     NA FALSE

相比之下,可以通过使用无向图(directed = FALSE)或通过更宽松的标准(mode = "weak")对行进行正确分组。当JOINed数据中存在𝑦ℛ𝑥时,这两种方法都将有效地模拟𝑥ℛ𝑦的存在。

此对称属性在对您描述的行为建模时尤为重要here

在这种情况下,您希望模型识别任意两行𝑦和𝑧必须位于同一组(𝑦ℛ𝑧)中,只要它们的End_Date与其他行𝑥:𝑦ℛ𝑥和𝑧ℛ𝑥的相同Start_Date匹配。

假设我们知道𝑦ℛ𝑥和𝑧ℛ𝑥。因为我们的模型保留了对称性,我们可以从𝑧ℛ𝑥那里说𝑥ℛ𝑧也是。因为我们现在知道𝑦ℛ𝑥和𝑥ℛ𝑧,transitivity意味着𝑦ℛ𝑧。因此,我们的模型认识到,无论何时𝑦ℛ𝑧𝑦ℛ𝑥和𝑧ℛ𝑥!类似的逻辑适用于";,反之亦然。

我们可以使用

验证此结果
my_data_frame <- my_data_frame %>%
  rbind(list(Name = "A",
             Start_Date = as.Date("2010-01-01"),
             End_Date = as.Date("2015-01-05")))

要在工作流之前将第8行追加到my_data_frame,请执行以下操作:

    Name Start_Date   End_Date
  1    A 2015-01-01 2019-12-29
# ⋮    ⋮      ⋮           ⋮
  4    A 2012-04-16 2015-01-09
# ⋮    ⋮      ⋮           ⋮
  8    A 2010-01-01 2015-01-05
这第8行用作我们的𝑧,其中𝑥是第1行,𝑦是第4行,和以前一样。事实上,output将𝑦和𝑧正确归类为属于同一组1:𝑦ℛ𝑧。

   Name Start_Date   End_Date Wanted  Flag
1:    A 2015-01-01 2019-12-29      1  TRUE
2:    A 2017-03-25       <NA>     NA FALSE
3:    A 2019-10-17       <NA>      1 FALSE
4:    A 2012-04-16 2015-01-09      1 FALSE
5:    A 2002-06-01 2006-02-01      2 FALSE
6:    A 2005-12-24       <NA>      2 FALSE
7:    B 2018-01-23       <NA>     NA FALSE
8:    A 2010-01-01 2015-01-05      1 FALSE

同样,output正确地Flag是第一行,其Start_Date现在由第4行和第8行的两个End_Date匹配。

干杯!

这篇关于基于模糊标准创建组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-15 17:01