问题描述
我有两个表,我需要在PostgreSQL中合并在一起,在公共变量公司名称。不幸的是,许多公司名称不完全匹配(即一个表中的MICROSOFT,另一个表中的MICROSFT)。我试图从两个列,如公司或公司或公司中删除常用词,以尝试标准化两个表的名称,但我有麻烦思考其他策略。有任何想法吗?
感谢。
此外,如果有必要,我可以在R中进行。
你考虑过fuzzystrmatch模块吗?您可以使用 soundex
,差异
, levenshtein
, metaphone
和 dmetaphone
或组合。
SELECT something
FROM某处
WHERE levenshtein(item1,item2)例如,从 MICROSOFT 到 MICROSFT 的levenshtein距离 / strong>为一(1)。 levenshtein(dmetaphone('MICROSOFT'),dmetaphone('MICROSFT')
上述返回零(0)。结合levenshtein和dmetaphone可以帮助匹配大量拼写错误。 >
I have two tables that I need to merge together in PostgreSQL, on the common variable "company name." Unfortunately many of the company names don't match exactly (i.e. MICROSOFT in one table, MICROSFT in the other). I've tried removing common words from both columns such as "corporation" or "inc" or "ltd" in order to try to standardize names across both tables, but I'm having trouble thinking of additional strategies. Any ideas?
Thanks.
Also, if necessary I can do this in R.
解决方案 Have you considered the fuzzystrmatch module? You can use soundex
, difference
, levenshtein
, metaphone
and dmetaphone
, or a combination.
SELECT something
FROM somewhere
WHERE levenshtein(item1, item2) < Carefully_Selected_Threshold
For example the levenshtein distance from MICROSOFT to MICROSFT is one (1).
levenshtein(dmetaphone('MICROSOFT'), dmetaphone('MICROSFT')
The above returns zero (0). Combining levenshtein and dmetaphone could help you match lots of misspellings.
这篇关于匹配模糊字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!