问题描述
我正在尝试衡量标记之间的相似性.我正在使用默认的 en 模型.相似性度量在使用单数名词时按预期工作,但在使用复数形式的相同名词时返回零.
I am trying to measure the similarity between tokens. I am using the default en model. The similarity measure works as expected when using singular nouns but returns zero when using the same nouns in plural.
nlp = spacy.load('en')
doc = nlp('apple orange')
doc[0].similarity(doc[1])
返回 0.56189166448170025
returns 0.56189166448170025
doc = nlp('apples oranges')
doc[0].similarity(doc[1])
返回 0.0
是否需要执行任何预处理步骤才能使措施正常工作?谢谢.
Are there any preprocessing steps I need to implement for the measure to work correctly? Thanks.
推荐答案
我认为它不支持短语相似性;一个hacky的替代方法是标记你的短语,它的分数是每个标记相似性的平均值.或者,您可以在此处使用短语相似度.
I think it doesn't have support phrasal similarity; a hacky alternative is to tokenize your phrase, where its score would be average of the similarities of each token. Alternatively you can use the phrasal similarity here.
这篇关于Spacy 令牌中的相似性度量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!