我试图找出使用agrep在两个字符串名称之间进行模糊字符串匹配的最佳精度。

但是,由于字符串数量巨大,因此我需要选择一种精度“max.distance”以在我要匹配的所有字符串上应用相同的精度。
无法为我要匹配的每个字符串选择最佳精度值“max.distance”。

例如,假设我对每个“BANK OF AMERICA CORP”和“1st Capital Bank”使用精度“max.distance”分别为“0.2”,“0.1”和“0.05”。

首先,以下是“BANK OF AMERICA CORP”的“最大距离”为“0.2”,“0.1”和“0.05”的情况:

    > agrep("BANK OF AMERICA CORP",C1999_0[,2],ignore.case = TRUE, value = TRUE,fixed = TRUE,max.distance =0.2)
     [1] "BANK OF AMERICA/PRIVATE BANK WEST"   "BANK OF AMERICA SECURITIES"
     [3] "BANK OF AMERICA SEC LLC"             "BANK OF AMERICA SECURITIES LLC"
     [5] "BANK OF AMERICA NT & SA"             "BANK OF AMERICA CORP"
     [7] "ALLIANZ OF AMERICA CORP"             "Bank of America Securities/Vice Pre"
     [9] "Bank of America Securities/Investme" "Bank of America/President"
    [11] "Bank of America Securities LLC/Prin" "Bank of America Securities LLC/Mana"
    [13] "Bank of America Securities LLC/Inve" "Bank of America Securities/Principa"
    [15] "Bank of America Securities LLC/Bank" "Bank of America Sec/Investment Bank"
    [17] "Bank Of America Securities/Managing" "Bank of America/Chairman--Midwest A"
    [19] "Bank of America Securities LLC/Vice" "Bank of America Corporation/Sales C"
    [21] "Bank of America Securities/Broker"   "Bank of America Corporation/Banker"
    [23] "Bank of America Corporation/Senior"  "Bank of America Securities/Equity R"
    [25] "Bank of America Corporation/Vice Ch" "BANK OF AMERICA CORPORATION"
    [27] "BANK OF AMERICA HEADQUARTERS"        "BANK OF AMERICA ADMINISTRATION"
    [29] "BANK OF AMERICA N A"                 "Bank of America/Commercial Banking"
    [31] "Bank of America Sec./Investment Ban"
    >
    > agrep("BANK OF AMERICA CORP",C1999_0[,2],ignore.case = TRUE, value = TRUE,fixed = TRUE,max.distance =0.1)
    [1] "BANK OF AMERICA CORP"                "ALLIANZ OF AMERICA CORP"
    [3] "Bank of America Corporation/Sales C" "Bank of America Corporation/Banker"
    [5] "Bank of America Corporation/Senior"  "Bank of America Corporation/Vice Ch"
    [7] "BANK OF AMERICA CORPORATION"
    >
    > agrep("BANK OF AMERICA CORP",C1999_0[,2],ignore.case = TRUE, value = TRUE,fixed = TRUE,max.distance =0.05)
    [1] "BANK OF AMERICA CORP"                "Bank of America Corporation/Sales C"
    [3] "Bank of America Corporation/Banker"  "Bank of America Corporation/Senior"
    [5] "Bank of America Corporation/Vice Ch" "BANK OF AMERICA CORPORATION"

然后是“第一距离银行”,“最大距离”为“0.2”,“0.1”和“0.05”:
    > agrep("1st Capital Bank",C1999_0[,2],ignore.case = TRUE, value = TRUE,fixed = TRUE,max.distance =0.2)
      [1] "HURST CAPITAL PARTNERS"
      [2] "SOY CAPITAL BANK"
      [3] "FIRST CAPITOL BANK OF VICTOR"
      [4] "OSTERWEIS CAPITAL MANAGEMENT"
      [5] "1ST NATIONAL BANK"
      [6] "FIRST CAPITAL BANK"
      [7] "SEATTLE 1ST NAT'L BANK"
      [8] "FIELD POINT CAPITAL MANAGEMENT"
      [9] "SUMMERSET CAPITAL MANAGEMENT"
     [10] "AMERIQUEST CAPITAL ASSOC"
     [11] "BB&T CAPITAL MARKETS"
     [12] "HUGHES CAPITAL MANAGEMENT"
     [13] "WELLS CAPITAL MANAGEMENT"
     [14] "SUPERIOR ST CAPITAL ADVISORS"
     [15] "ORMES CAPITAL MARKETS INC"
     [16] "1ST NAT'L BANK OF IL"
     [17] "ADVENT CAPITAL MANAGEMENT"
     [18] "1ST CAPITOL BANK"
     [19] "BIONDI REISS CAPITAL MANAGEMENT"
     [20] "CCYBYS CAPITAL MARKETS"
     [21] "SEACOAST CAPITAL PARTNERS"
     [22] "DOUGLAS CAPITAL MANAGEMENT"
     [23] "HIGHFIELDS CAPITAL MANAGEMENT"
     [24] "PRECEPT CAPITAL MANAGEMENT LP"
     [25] "AUGUST CAPITAL MANAGEMENT"
     [26] "SAKSA CAPITAL MANAGEMENT"
     [27] "IMS CAPITAL MANAGEMENT"
     [28] "TRENT CAPITAL MANAGEMENT"
     [29] "Ormes Capital Management"
     [30] "GARNET CAPITAL MANAGEMENT LLC"
     [31] "INTERFASE CAPITAL MANAGERS"
     [32] "RJS CAPITAL MANAGEMENT INC"
     [33] "1ST NATIONAL BANK OF DE KALB"
     [34] "1ST NAT'L BANK OF PHILLIPS CO"
     [35] "1ST NAT'L BANK OF OKLAHOMA"
     [36] "PROGRESS CAPITAL MANAGEMENT INC"
     [37] "CAPITAL BANK & TRUST"
     [38] "1ST NATL BANK"
     [39] "ASB Capital Management/Real Estate"
     [40] "Sears Capital Management"
     [41] "Osterweis Capital Management/Invest"
     [42] "Cerberus Capital Management LP/Asse"
     [43] "LVS Capital Management/President"
     [44] "1st Central Bank/Banker"
     [45] "Summit Capital Management"
     [46] "Orwes Capital Markets/Stockbroker"
     [47] "Ormes Capital Management/Investment"
     [48] "Nevis Capital Management/Investment"
     [49] "Duncan Hurst Capital Management"
     [50] "Progress Capital Management/Preside"
     [51] "Cerberus Capital Management LP"
     [52] "Wit Capital/Banker"
     [53] "Ormes Capital Markets Inc."
     [54] "Ormes Capital Markets/President & C"
     [55] "Berents & Hess Capital Management"
     [56] "Progress Capital Management/Venture"
     [57] "First Capital Bank of KY"
     [58] "Foothill Capital/Banker"
     [59] "Pequot Capital Management/Equity Re"
     [60] "First Dominion Capital/Banking"
     [61] "Greenwhich Capital/Banker"
     [62] "Veritas Capital Management/Banker"
     [63] "Veritas Capital Management/Investme"
     [64] "Lesese Capital Management/Investmen"
     [65] "Douglas Capital Management/Investme"
     [66] "FIRST NATINAL BANK OF AMARILLO"
     [67] "NEVIS CAPITAL MANAGEMENT"
     [68] "VERITAS CAPITAL MANAGEMENT"
     [69] "SIEBERT CAPITAL MARKETS"
     [70] "HOURGLASS CAPITAL MANAGEMENT"
     [71] "1ST NATIONAL BANK DALHART"
     [72] "TEXAS CAPITAL BANK"
     [73] "NICHOLAS CAPITAL MANAGEMENT"
     [74] "CERBUS CAPITAL MANAGEMENT"
     [75] "CROESUS CAPITAL MANAGEMENT"
     [76] "EAST WEST CAPITAL ASSOCIATES INC"
     [77] "PRENDERGAST CAPITAL MANAGEMENT"
     [78] "NANTUCKET CAPITAL MANAGEMENT"
     [79] "1ST NATIONAL BANK TEMPLE"
     [80] "ENTRUST CAPITAL INC"
     [81] "1ST NATIONAL BANK OF IL"
     [82] "SIMMS CAPITAL MANAGEMENT"
     [83] "FIRST CAPITAL ADVISORS"
     [84] "FIRST CAPITAL MANAGEMENT LTD"
     [85] "1ST NATIONAL BANK & TRUST"
     [86] "PENTECOST CAPITAL MANAGEMENT INC"
     [87] "EAST-WEST CAPITAL ASSOCIATES"
     [88] "1ST NAT'L BANK OF JOLIET"
     [89] "FIRST CAPITOL BANK OF VICTO"
     [90] "FIRST CAPITAL FINANCIAL"
     [91] "PACIFIC COAST CAPITAL PARTNERS"
     [92] "FIRST CAPITOL BANK"
     [93] "FIRST CAPITAL ENGINEERING"
     [94] "MIDWEST CAPITOL MANAGEMENT"
     [95] "PEQUOT CAPITAL MANAGEMENT"
     [96] "AGGOTT CAPITAL MANAGEMENT"
     [97] "SIMMS CAPITAL MANAGEMENT INC"
     [98] "PHILLIPS CAPITAL MANAGEMENT LLC"
     [99] "1ST NATIONAL BANK OF COLD SP"
    [100] "SOY CAPITOL BANK"
    >
    > agrep("1st Capital Bank",C1999_0[,2],ignore.case = TRUE, value = TRUE,fixed = TRUE,max.distance =0.1)
    [1] "FIRST CAPITOL BANK OF VICTOR" "FIRST CAPITAL BANK"
    [3] "1ST CAPITOL BANK"             "First Capital Bank of KY"
    [5] "TEXAS CAPITAL BANK"           "FIRST CAPITOL BANK OF VICTO"
    [7] "FIRST CAPITOL BANK"
    >
    > agrep("1st Capital Bank",C1999_0[,2],ignore.case = TRUE, value = TRUE,fixed = TRUE,max.distance =0.05)
    [1] "FIRST CAPITAL BANK"       "1ST CAPITOL BANK"
    [3] "First Capital Bank of KY"

如您所见,要找到适用于每个字符串的“最大距离”的通用精度值,例如“BANK OF AMERICA CORP”和“1st Capital Bank”,确实很困难。此外,除了这两个名称之外,我还有更多的公司名称,这就是为什么我很难找到通用的精度值和模糊字符串匹配的命令的原因。

C1999_0的原始数据文件太大而无法附加,因此我认为仅使用上述输出值即可复制。

我知道有几个子类别可以操作,例如成本,替换,插入等,但仅更改“最大距离”值本身并没有多大区别。

如果能在此方面获得帮助,我将不胜感激!

最佳答案

如上所述,这似乎是一个无法解决的问题,没有一个最大距离可以很好地适用于所有输入字符串。

可能需要尝试使用tf-idf这样的方法来识别字符串的异常情况,并将最大距离扩展到该范围。因此,“Ziggurat Mutual”可能比“First Bank National”具有更大的变更空间,而“First Bank National”则更为通用。

您可能还考虑使用Fuzzyjoin程序包,该程序包提供了一些尝试不同选项的快速方法。例如,您可以尝试:

df <- c("HURST CAPITAL PARTNERS", "SOY CAPITAL BANK", "FIRST CAPITOL BANK OF VICTOR", "OSTERWEIS CAPITAL MANAGEMENT", "1ST NATIONAL BANK", "FIRST CAPITAL BANK", "SEATTLE 1ST NAT'L BANK", "FIELD POINT CAPITAL MANAGEMENT", "SUMMERSET CAPITAL MANAGEMENT", "AMERIQUEST CAPITAL ASSOC", "BB&T CAPITAL MARKETS", "HUGHES CAPITAL MANAGEMENT", "WELLS CAPITAL MANAGEMENT", "SUPERIOR ST CAPITAL ADVISORS", "ORMES CAPITAL MARKETS INC", "1ST NAT'L BANK OF IL", "ADVENT CAPITAL MANAGEMENT", "1ST CAPITOL BANK", "BIONDI REISS CAPITAL MANAGEMENT", "CCYBYS CAPITAL MARKETS", "SEACOAST CAPITAL PARTNERS", "DOUGLAS CAPITAL MANAGEMENT", "HIGHFIELDS CAPITAL MANAGEMENT", "PRECEPT CAPITAL MANAGEMENT LP", "AUGUST CAPITAL MANAGEMENT", "SAKSA CAPITAL MANAGEMENT", "IMS CAPITAL MANAGEMENT", "TRENT CAPITAL MANAGEMENT", "Ormes Capital Management", "GARNET CAPITAL MANAGEMENT LLC", "INTERFASE CAPITAL MANAGERS", "RJS CAPITAL MANAGEMENT INC", "1ST NATIONAL BANK OF DE KALB", "1ST NAT'L BANK OF PHILLIPS CO", "1ST NAT'L BANK OF OKLAHOMA", "PROGRESS CAPITAL MANAGEMENT INC", "CAPITAL BANK & TRUST", "1ST NATL BANK", "ASB Capital Management/Real Estate", "Sears Capital Management", "Osterweis Capital Management/Invest", "Cerberus Capital Management LP/Asse", "LVS Capital Management/President", "1st Central Bank/Banker", "Summit Capital Management", "Orwes Capital Markets/Stockbroker", "Ormes Capital Management/Investment", "Nevis Capital Management/Investment", "Duncan Hurst Capital Management", "Progress Capital Management/Preside", "Cerberus Capital Management LP", "Wit Capital/Banker", "Ormes Capital Markets Inc.", "Ormes Capital Markets/President & C", "Berents & Hess Capital Management", "Progress Capital Management/Venture", "First Capital Bank of KY", "Foothill Capital/Banker", "Pequot Capital Management/Equity Re", "First Dominion Capital/Banking", "Greenwhich Capital/Banker", "Veritas Capital Management/Banker", "Veritas Capital Management/Investme", "Lesese Capital Management/Investmen", "Douglas Capital Management/Investme", "FIRST NATINAL BANK OF AMARILLO", "NEVIS CAPITAL MANAGEMENT", "VERITAS CAPITAL MANAGEMENT", "SIEBERT CAPITAL MARKETS", "HOURGLASS CAPITAL MANAGEMENT", "1ST NATIONAL BANK DALHART", "TEXAS CAPITAL BANK", "NICHOLAS CAPITAL MANAGEMENT", "CERBUS CAPITAL MANAGEMENT", "CROESUS CAPITAL MANAGEMENT", "EAST WEST CAPITAL ASSOCIATES INC", "PRENDERGAST CAPITAL MANAGEMENT", "NANTUCKET CAPITAL MANAGEMENT", "1ST NATIONAL BANK TEMPLE", "ENTRUST CAPITAL INC", "1ST NATIONAL BANK OF IL", "SIMMS CAPITAL MANAGEMENT", "FIRST CAPITAL ADVISORS", "FIRST CAPITAL MANAGEMENT LTD", "1ST NATIONAL BANK & TRUST", "PENTECOST CAPITAL MANAGEMENT INC", "EAST-WEST CAPITAL ASSOCIATES", "1ST NAT'L BANK OF JOLIET", "FIRST CAPITOL BANK OF VICTO", "FIRST CAPITAL FINANCIAL", "PACIFIC COAST CAPITAL PARTNERS", "FIRST CAPITOL BANK", "FIRST CAPITAL ENGINEERING", "MIDWEST CAPITOL MANAGEMENT", "PEQUOT CAPITAL MANAGEMENT", "AGGOTT CAPITAL MANAGEMENT", "SIMMS CAPITAL MANAGEMENT INC", "PHILLIPS CAPITAL MANAGEMENT LLC", "1ST NATIONAL BANK OF COLD SP", "SOY CAPITOL BANK")

library(dplyr); library(fuzzyjoin)
df <- df %>% as_data_frame()

df %>%
  # Allowable methods include osa, lv, dl, hamming, lcs, qgram,
  #    cosine, jaccard, jw, soundex
  fuzzyjoin::stringdist_inner_join(df, method = "lv", distance_col = "distance", max_dist = 4) %>%
  filter(distance > 0)

Joining by: "value"
# A tibble: 70 x 3
   value.x                      value.y                     distance
   <chr>                        <chr>                          <dbl>
 1 SOY CAPITAL BANK             1ST CAPITOL BANK                   4
 2 SOY CAPITAL BANK             SOY CAPITOL BANK                   1
 3 FIRST CAPITOL BANK OF VICTOR FIRST CAPITOL BANK OF VICTO        1
 4 1ST NATIONAL BANK            1ST NATL BANK                      4
 5 FIRST CAPITAL BANK           1ST CAPITOL BANK                   4
 6 FIRST CAPITAL BANK           FIRST CAPITOL BANK                 1
 7 HUGHES CAPITAL MANAGEMENT    DOUGLAS CAPITAL MANAGEMENT         4
 8 HUGHES CAPITAL MANAGEMENT    AUGUST CAPITAL MANAGEMENT          4
 9 WELLS CAPITAL MANAGEMENT     IMS CAPITAL MANAGEMENT             4
10 WELLS CAPITAL MANAGEMENT     NEVIS CAPITAL MANAGEMENT           3

...以在您的示例列表中尝试潜在的不完全匹配。

关于r - 如何使用agrep获取精确的通用 “max.distance”值以进行模糊字符串匹配?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/52273813/

10-12 14:36
查看更多