问题描述
我正在尝试使用 Python 2.7 和 Levenshtein 函数将姓氏列表与全名列表相匹配.为了减少工作量,我只匹配第一个字母是否相同(尽管这似乎对性能没有太大影响).如果找到匹配项,则从全名中删除匹配的单词(以使后续的名字匹配更容易).两个列表都包含几万个条目,所以我的解决方案相当慢.如何在不解析全名的情况下加快速度?这是我目前所拥有的(对于姓氏由多个单词组成的情况,我省略了一些 if 条件):
I am trying to match a list of lastnames to a list of full names using Python 2.7 and the Levenshtein function. To reduce workload I only match if the first letters are identical (although this doesn't seem to make much of a difference performance-wise). If a match is found the matching word is removed from the full names (to make a subsequent first name matching easier).Both lists contain several ten thousand entries, so my solution is rather slow. How could I speed things up without parsing the fullnames?Here is what I have so far (I have omitted a few if-conditions for cases where the lastnames consist of several words):
import Levenshtein
listoflastnames=(['Jones', 'Sallah'])
listoffullnames=(['Henry', 'Jones', 'Junior'],['Indiana', 'Jones'])
def match_strings(lastname, listofnames):
match=0
matchedidx=[]
for index, nameelement in enumerate(listofnames):
if lastname[0]==nameelement [0]:
if Levenshtein.distance(nameelement, lastname)<2:
matchedidx.append(index)
match=match+1
if match==1:
newnamelist = [i for j, i in enumerate(listofnames) if j not in matchedidx]
return 1, newnamelist
return 0, listofnames
for x in listoflastnames:
for y in listoffullnames:
match, newlistofnames=match_strings(x,y)
if match==1:
#go to first name match...
任何帮助将不胜感激!
更新:在此期间,我使用了多处理模块让我的所有 4 个核心处理问题,而不是一个,但匹配仍然需要很多时间.
Update: in the meantime I have used the multiprocessing module to let all of my 4 cores handle the issue instead of just one, but the matching still takes a lot of time.
推荐答案
这简化了 match_string
函数中的 for
循环,但并没有显着提高速度我的测试.最大的损失是在两个带有姓氏和全名的 for
循环中.
This simplifies the for
loop in the match_string
function, but didn't increase the speed noticeably in my tests. The biggest loss is in the two for
loops with lastnames and fullnames.
def match_strings(lastname, listofnames):
firstCaseMatched = [name for name in listofnames if lastname[0] == name[0]]
if len(firstCaseMatched):
matchedidx = [index for index, ame in enumerate(firstCaseMatched) if Levenshtein.distance(lastname, name) < 2]
match = len(matchedidx)
else:
match = 0
if match == 1:
newnamelist = [i for j, i in enumerate(listofnames) if j not in matchedidx]
return 1, newnamelist
return 0, listofnames
您可能需要对已知姓氏列表进行排序,将它们拆分为每个起始字符的 dict
.然后将名称列表中的每个名称与其匹配.
You might have to sort the list of known last names, split them into a dict
for each starting character. And then match each name in the list of names against that.
假设全名列表总是将名字作为第一个元素.您可以将比较仅限于其他元素.
Assuming the fullnames list always has the first name as first element. You could limit the comparison to only the other elements.
这篇关于Python、嵌套循环、匹配和性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!