问题描述
我正在尝试使用TF-IDF 将文档分类为类别.我已经计算了一些文档的tf_idf,但是现在当我尝试计算其中两个文档之间的余弦相似度时,我会得到一个回溯信息:
I'm trying to use TF-IDF to sort documents into categories. I've calculated the tf_idf for some documents, but now when I try to calculate the Cosine Similarity between two of these documents I get a traceback saying:
#len(u)==201, len(v)==246
cosine_distance(u, v)
ValueError: objects are not aligned
#this works though:
cosine_distance(u[:200], v[:200])
>> 0.52230249969265641
对向量进行切片,以便len(u)== len(v)正确吗?我认为余弦相似度适用于不同长度的向量.
Is slicing the vector so that len(u)==len(v) the right approach? I would think that cosine similarity would work with vectors of different lengths.
我正在使用:
def cosine_distance(u, v):
"""
Returns the cosine of the angle between vectors v and u. This is equal to
u.v / |u||v|.
"""
return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v)))
还-向量中tf_idf值的顺序重要吗?应该对它们进行排序-还是对这种计算不重要?
Also -- is the order of the tf_idf values in the vectors important? Should they be sorted -- or is it of no importance for this calculation?
推荐答案
您是否在计算项向量的余弦相似度?术语向量的长度应相同.如果文档中没有单词,则该单词的值应为0.
Are you computing the cosine similarity of term vectors? Term vectors should be the same length. If words aren't present in a document then it should have a value of 0 for that term.
我不确定您要应用余弦相似度的向量,但是在进行余弦相似度时,向量的长度应始终相同,并且顺序非常重要.
I'm not exactly sure what vectors you're applying cosine similarity for but when doing cosine similarity then your vectors should always be the same length and order very much does matter.
示例:
Term | Doc1 | Doc2
Foo .3 .7
Bar | 0 | 8
Baz | 1 | 1
这里,您有两个向量(.3,0,1)和(.7,8,1),可以计算它们之间的余弦相似度.如果比较(.3,1)和(.7,8),您将把Baz的Doc1得分与Bar的Doc2得分进行比较.
Here you have two vectors (.3,0,1) and (.7,8,1) and can compute the cosine similarity between them. If you compared (.3,1) and (.7,8) you'd be comparing the Doc1 score of Baz against the Doc2 score of Bar which wouldn't make sense.
这篇关于不同长度向量的余弦相似度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!