问题描述
两个单词之间的相似度是否是量化的描述符,基于它们的发音/发音,类似于Levenshtein距离?
Is the a quantitative descriptor of similarity between two words based on how they sound/are pronounced, analogous to Levenshtein distance?
我知道soundex为相似的声音提供了相同的ID 单词,但据我所知,这并不是单词之间差异的定量描述.
I know soundex gives same id to similar sounding words, but as far as I undestood it is not a quantitative descriptor of difference between the words.
from jellyfish import soundex
print(soundex("two"))
print(soundex("to"))
推荐答案
您可以结合语音编码和字符串比较算法.实际上,水母
两者都提供.
You could combine phonetic encoding and string comparison algorithm. As a matter of fact jellyfish
supplies both.
设置库示例
from jellyfish import soundex, metaphone, nysiis, match_rating_codex,\
levenshtein_distance, damerau_levenshtein_distance, hamming_distance,\
jaro_similarity
from itertools import groupby
import pandas as pd
import numpy as np
dataList = ['two','too','to','fourth','forth','dessert',
'desert','Byrne','Boern','Smith','Smyth','Catherine','Kathryn']
sounds_encoding_methods = [soundex, metaphone, nysiis, match_rating_codex]
让我们比较不同的语音编码
Let compare different phonetic encoding
report = pd.DataFrame([dataList]).T
report.columns = ['word']
for i in sounds_encoding_methods:
print(i.__name__)
report[i.__name__]= report['word'].apply(lambda x: i(x))
print(report)
soundex metaphone nysiis match_rating_codex
word
two T000 TW TW TW
too T000 T T T
to T000 T T T
fourth F630 FR0 FART FRTH
forth F630 FR0 FART FRTH
dessert D263 TSRT DASAD DSRT
desert D263 TSRT DASAD DSRT
Byrne B650 BRN BYRN BYRN
Boern B650 BRN BARN BRN
Smith S530 SM0 SNAT SMTH
Smyth S530 SM0 SNYT SMYTH
Catherine C365 K0RN CATARAN CTHRN
Kathryn K365 K0RN CATRYN KTHRYN
您可以看到语音编码在使单词可比性方面做得很好.您可能会看到不同的情况,并根据情况选择一个或多个.
You can see that phonetic encoding is doing a pretty good job making comparable the words. You could see different cases and prefer one or other depending on your case.
现在,我将采用以上内容,并尝试使用levenshtein_distance查找最接近的匹配项,但我也可以尝试其他任何匹配项.
Now I will just take the above and try to find the closest match using levenshtein_distance, but I could you any other too.
"""Select the closer by algorithm
for instance levenshtein_distance"""
report2 = pd.DataFrame([dataList]).T
report2.columns = ['word']
report.set_index('word',inplace=True)
report2 = report.copy()
for sounds_encoding in sounds_encoding_methods:
report2[sounds_encoding.__name__] = np.nan
matched_words = []
for word in dataList:
closest_list = []
for word_2 in dataList:
if word != word_2:
closest = {}
closest['word'] = word_2
closest['similarity'] = levenshtein_distance(report.loc[word,sounds_encoding.__name__],
report.loc[word_2,sounds_encoding.__name__])
closest_list.append(closest)
report2.loc[word,sounds_encoding.__name__] = pd.DataFrame(closest_list).\
sort_values(by = 'similarity').head(1)['word'].values[0]
print(report2)
soundex metaphone nysiis match_rating_codex
word
two too too too too
too two to to to
to two too too too
fourth forth forth forth forth
forth fourth fourth fourth fourth
dessert desert desert desert desert
desert dessert dessert dessert dessert
Byrne Boern Boern Boern Boern
Boern Byrne Byrne Byrne Byrne
Smith Smyth Smyth Smyth Smyth
Smyth Smith Smith Smith Smith
Catherine Kathryn Kathryn Kathryn Kathryn
Kathryn Catherine Catherine Catherine Catherine
从上面可以清楚地看到,语音编码和字符串比较算法之间的组合非常简单.
As from above you could clearly see that combinations between phonetic encoding and string comparison algorithm can be very straight forward.
这篇关于弦之间的距离(通过声音相似度)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!