本文介绍了在 pandas 中执行模糊字符串匹配的更快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有办法使用Fuzzywuzzy在大熊猫中加快模糊字符串的匹配.

Is there any way to speed up the fuzzy string match using fuzzywuzzy in pandas.


我有一个数据框为extra_names,它的名称与其他数据框为names_df时要进行模糊匹配.


I have a dataframe as extra_names which has names that I want to run fuzzy matches for with another dataframe as names_df.

>> extra_names.head()

     not_matching
0 Vij Sales
1 Crom Electronics
2 REL Digital
3 Bajaj Elec
4 Reliance Digi

>> len(extra_names)
6500

>> names_df.head()

         names   types
0 Vijay Sales        1
1 Croma Electronics  1
2 Reliance Digital   2
3 Bajaj Electronics  2
4 Pai Electricals    2

>> len(names_df)
250

到目前为止,我正在使用以下代码运行逻辑,但是要花很长时间才能完成.

As of now, I'm running the logic using the following code, but its taking forever to complete.

choices = names_df['names'].unique().tolist()

def fuzzy_match(row):
    best_match = process.extractOne(row, choices)
    return best_match[0], best_match[1] if best_match else '',''

%%timeit
extra_names['best_match'], extra_names['match%'] = extra_names['not_matching'].apply(fuzzy_match)

在我发布此问题时,查询仍在运行.有什么方法可以加快此模糊字符串匹配过程的速度?

As I'm posting this question, the query is still running. Is there any way to speed up this fuzzy string matching process?

推荐答案

让我们尝试difflib:

import difflib
from functools import partial

f = partial(
    difflib.get_close_matches, possibilities=names_df['names'].tolist(), n=1)

matches = extra_names['not_matching'].map(f).str[0].fillna('')
scores = [
    difflib.SequenceMatcher(None, x, y).ratio()
    for x, y in zip(matches, extra_names['not_matching'])
]

extra_names.assign(best=matches, score=scores)

       not_matching               best     score
0         Vij Sales        Vijay Sales  0.900000
1  Crom Electronics  Croma Electronics  0.969697
2       REL Digital   Reliance Digital  0.666667
3        Bajaj Elec  Bajaj Electronics  0.740741
4     Reliance Digi   Reliance Digital  0.896552

这篇关于在 pandas 中执行模糊字符串匹配的更快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-21 05:36