问题描述
我正试图从列表中获得所有类似的发音.
I am trying to get all the similar sounding words from a list.
我尝试使用余弦相似度来获取它们,但这不能实现我的目的.
I tried to get them using cosine similarity but that does not fulfil my purpose.
from sklearn.metrics.pairwise import cosine_similarity
dataList = ['two','fourth','forth','dessert','to','desert']
cosine_similarity(dataList)
我知道这不是正确的方法,我似乎无法得到如下结果:
I know this is not the right approach, I cannot seem to get a result like:
result = ['xx', 'xx', 'yy', 'yy', 'zz', 'zz']
它们的意思是听起来相似的词
where they mean that the words which sound similar
推荐答案
首先,您需要使用一种正确的方法来获得相似的发音,即字符串相似性,我建议:
First, you need to use a right way to get the similar sounding words i.e. string similarity, I would suggest:
使用 水母
:
from jellyfish import soundex
print(soundex("two"))
print(soundex("to"))
输出:
T000
T000
现在,也许可以创建一个处理列表的函数,然后对其进行排序以获取它们:
Now perhaps, create a function that would handle the list and then sort it to get them:
def getSoundexList(dList):
res = [soundex(x) for x in dList] # iterate over each elem in the dataList
# print(res) # ['T000', 'F630', 'F630', 'D263', 'T000', 'D263']
return res
dataList = ['two','fourth','forth','dessert','to','desert']
print([x for x in sorted(getSoundexList(dataList))])
输出:
['D263', 'D263', 'F630', 'F630', 'T000', 'T000']
编辑:
另一种方式可能是:
使用 fuzzy
:
import fuzzy
soundex = fuzzy.Soundex(4)
print(soundex("to"))
print(soundex("two"))
输出:
T000
T000
编辑2 :
如果要对它们进行分组
,则可以使用groupby:
If you want them grouped
, you could use groupby:
from itertools import groupby
def getSoundexList(dList):
return sorted([soundex(x) for x in dList])
dataList = ['two','fourth','forth','dessert','to','desert']
print([list(g) for _, g in groupby(getSoundexList(dataList), lambda x: x)])
输出:
[['D263', 'D263'], ['F630', 'F630'], ['T000', 'T000']]
编辑3 :
这是@Eric Duminil的名字,假设您要同时使用名称
和它们各自的 val
:
This ones for @Eric Duminil, let's say you want both the names
and their respective val
:
使用 dict
和 itemmetter
:
from operator import itemgetter
def getSoundexDict(dList):
return sorted(dict_.items(), key=itemgetter(1)) # sorting the dict_ on val
dataList = ['two','fourth','forth','dessert','to','desert']
res = [soundex(x) for x in dataList] # to get the val for each elem
dict_ = dict(list(zip(dataList, res))) # dict_ with k,v as name/val
print([list(g) for _, g in groupby(getSoundexDict(dataList), lambda x: x[1])])
输出:
[[('dessert', 'D263'), ('desert', 'D263')], [('fourth', 'F630'), ('forth', 'F630')], [('two', 'T000'), ('to', 'T000')]]
编辑4 (用于OP):
Soundex:
这篇关于如何将发音相似的单词放在一起的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!