问题描述
我在数组中有5000个,有时甚至更多的街道地址字符串。我想将它们与levenshtein进行比较,以找到相似的匹配项。如何做到这一点而又不循环遍历所有5000个数据并将它们与其他4999个数据直接进行比较?
I have 5000, sometimes more, street address strings in an array. I'd like to compare them all with levenshtein to find similar matches. How can I do this without looping through all 5000 and comparing them directly with every other 4999?
编辑:我对替代方法也很感兴趣如果有人有建议。总体目标是根据用户提交的街道地址查找相似的条目(并消除重复项)。
I am also interested in alternate methods if anyone has suggestions. The overall goal is to find similar entries (and eliminate duplicates) based on user-submitted street addresses.
推荐答案
我认为更好分组相似地址的方法是:
I think a better way to group similar addresses would be to:
-
创建一个包含两个表的数据库-一个用于地址(和ID) ,一个用于地址中单词或文字数字的声音表达(使用地址表的外键)
create a database with two tables - one for the address (and a id), one for the soundexes of words or literal numbers in the address (with the foreign key of the addresses table)
大写地址,请替换除[ AZ]或[0-9]带有空格
uppercase the address, replace anything other than [A-Z] or [0-9] with a space
按空格分隔地址,计算每个单词的音色,仅保留数字作为是并将其与您以以下地址开头的外键一起存储在soundexes表中
split the address by space, calculate the soundex of each 'word', leave anything with just digits as is and store it in the soundexes table with the foreign key of the address you started with
,对于每个地址(ID为$ target),查找最相似的地址:
for each address (with id $target) find the most similar addresses:
SELECT similar.id, similar.address, count(*)
FROM adress similar, word cmp, word src
WHERE src.address_id=$target
AND src.soundex=cmp.soundex
AND cmp.address_id=similar.id
ORDER BY count(*)
LIMIT $some_value;
计算源地址与查询返回的前几个值之间的levenstein差。
calculate the levenstein difference between your source address and the top few values returned by the query.
(在数据库中对大型阵列执行任何操作通常会更快)
(doing any sort of operation on large arrays is often faster in databases)
这篇关于用PHP Levenshtein比较5000个字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!