问题描述
我正在尝试计算两个单词之间的语义相似度.我正在使用基于Wordnet的相似性度量,即Resnik度量(RES),Lin度量(LIN),Jiang和Conrath度量(JNC)以及Banerjee和Pederson度量(BNP).
I am trying to calculate semantic similarity between two words. I am using Wordnet-based similarity measures i.e Resnik measure(RES), Lin measure(LIN), Jiang and Conrath measure(JNC) and Banerjee and Pederson measure(BNP).
为此,我正在使用nltk和Wordnet 3.0.接下来,我要合并从不同度量获得的相似性值.为此,我需要对相似性值进行归一化,因为某些度量给出的值介于0和1之间,而另一些度量给出的值则大于1.
To do that, I am using nltk and Wordnet 3.0. Next, I want to combine the similarity values obtained from different measure. To do that i need to normalize the similarity values as some measure give values between 0 and 1, while others give values greater than 1.
所以,我的问题是如何标准化从不同度量获得的相似度值.
So, my question is how do I normalize the similarity values obtained from different measures.
更多细节:我实际上想做的事情:我有一些字眼.我计算单词之间的成对相似度.并删除与集合中其他单词没有强烈关联的单词.
Extra detail on what I am actually trying to do: I have a set of words. I calculate pairwise similarity between the words. and remove the words that are not strongly correlated with other words in the set.
推荐答案
如何规范单个度量
让我们考虑一个任意的相似度度量M
并取一个任意的词w
.
How to normalize a single measure
Let's consider a single arbitrary similarity measure M
and take an arbitrary word w
.
定义m = M(w,w)
.然后,m取最大可能值M
.
Define m = M(w,w)
. Then m takes maximum possible value of M
.
让我们将MN
定义为标准化度量M
.
Let's define MN
as a normalized measure M
.
对于任何两个单词w, u
,您都可以计算MN(w, u) = M(w, u) / m
.
For any two words w, u
you can compute MN(w, u) = M(w, u) / m
.
很容易看到,如果M
采用非负值,那么MN
采用[0, 1]
中的值.
It's easy to see that if M
takes non-negative values, then MN
takes values in [0, 1]
.
为了计算自己定义的量度F
,将k个不同的量度m_1, m_2, ..., m_k
组合在一起,首先使用上述方法分别对每个m_i
进行标准化,然后定义:
In order to compute your own defined measure F
combined of k different measures m_1, m_2, ..., m_k
first normalize independently each m_i
using above method and then define:
alpha_1, alpha_2, ..., alpha_k
,使得alpha_i
表示第i个小节的权重.
such that alpha_i
denotes the weight of i-th measure.
所有字母的总和必须等于1,即:
All alphas must sum up to 1, i.e:
alpha_1 + alpha_2 + ... + alpha_k = 1
然后为w, u
计算自己的度量:
Then to compute your own measure for w, u
you do:
F(w, u) = alpha_1 * m_1(w, u) + alpha_2 * m_2(w, u) + ... + alpha_k * m_k(w, u)
很明显,F
取[0,1]中的值
It's clear that F
takes values in [0,1]
这篇关于如何从Wordnet标准化相似性度量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!