问题描述
例如,我有两个文档:
Doc1 = {'python','numpy','machine learning'}
Doc2 = {'python','pandas','tensorflow','svm','regression','R'}
我也知道每对单词的similarity
(相关性),例如
Sim('python','python') = 1
Sim('python','pandas') = 0.8
Sim('numpy', 'R') = 0.1
衡量两个文档相似度的最佳方法是什么?
在这种情况下,传统的Jaccard distance
和cosine distance
似乎不是一个很好的指标.
在这里,他描述了两组字符串之间的 Monge-Elkan 相似性度量.对于第一个集合中的每个单词,您会找到第二个集合中最接近的单词,然后将其除以第一个集合中的元素数.您可以在此处第30页的 .
I have two documents, for example:
Doc1 = {'python','numpy','machine learning'}
Doc2 = {'python','pandas','tensorflow','svm','regression','R'}
And I also know the similarity
(correlation) of each pair of words, e.g
Sim('python','python') = 1
Sim('python','pandas') = 0.8
Sim('numpy', 'R') = 0.1
What is the best way to measure the similarity of the two documents?
It seems that the traditional Jaccard distance
and cosine distance
are not a good metric in this situation.
I like a book by Peter Christen on this issue.
Here he describes a Monge-Elkan similarity measure between two sets of strings.For each word from the first set you find the closest word from the second set and divide it by the number of elements in the first set.You can see its description on page 30 here.
这篇关于给定每对单词的相似度,如何测量两个文档的相似度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!