本文介绍了余弦相似度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我计算了两个文件的tf / idf值。以下是tf / idf值:
I calculated tf/idf values of two documents. The following are the tf/idf values:
1.txt
0.0
0.5
2.txt
0.0
0.5
文件如下:
1.txt = > dog cat
2.txt = > cat elephant
如何使用这些值来计算余弦相似度?
我知道我应该计算点积,然后通过它找到距离并除以点积。如何使用我的值来计算?
I know that I should calculate the dot product, then find distance and divide dot product by it. How can I calculate this using my values?
还有一个问题:这两个文档的单词数量是否相同是否重要?
推荐答案
a * b
sim(a,b) =--------
|a|*|b|
a * b是点积
一些细节:
def dot(a,b):
n = length(a)
sum = 0
for i in xrange(n):
sum += a[i] * b[i];
return sum
def norm(a):
n = length(a)
for i in xrange(n):
sum += a[i] * a[i]
return math.sqrt(sum)
def cossim(a,b):
return dot(a,b) / (norm(a) * norm(b))
是。在某种程度上,a和b必须具有相同的长度。但是a和b通常都有稀疏表示,你只需要存储非零项,你就可以更快地计算norm和dot。
yes. to some extent, a and b must have the same length. but a and b usually have sparse representation, you only need to store non-zero entries and you can calculate norm and dot more fast.
这篇关于余弦相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!