问题描述
我目前正在使用 difflib 中的方法get_close_matches方法进行迭代15,000个字符串的列表,以与大约15,000个字符串的另一个列表最接近:
I'm currently using method get_close_matches method from difflib to iterate through a list of 15,000 strings to get the closest match against another list of approx 15,000 strings:
a=['blah','pie','apple'...]
b=['jimbo','zomg','pie'...]
for value in a:
difflib.get_close_matches(value,b,n=1,cutoff=.85)
每个值花费.58秒,这意味着完成循环将花费8,714秒或145分钟.是否有另一种库/方法可能更快或更有效地提高了此方法的速度?我已经尝试过将两个数组都转换为小写,但这只会导致速度略有提高.
It takes .58 seconds per value which means it will take 8,714 seconds or 145 minutes to finish the loop. Is there another library/method that might be faster or a way to improve the speed for this method? I've already tried converting both arrays to lower case, but it only resulted in a slight speed increase.
推荐答案
fuzzyset
按字符串的二元组和三字母组索引,以便在difflib
的 O(log(N))与 O(N)中找到近似的匹配项.对于我的1M +个单词和单词对的模糊集,它可以在大约20秒内计算索引,并在不到100毫秒的时间内找到最接近的匹配项.
fuzzyset
indexes strings by their bigrams and trigrams so it finds approximate matches in O(log(N)) vs O(N) for difflib
. For my fuzzyset of 1M+ words and word-pairs it can compute the index in about 20 seconds and find the closest match in less than a 100 ms.
这篇关于更好的模糊匹配性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!