问题描述
我正在尝试使用scikit的Nearest Neighbor实现从随机值矩阵中查找与给定列向量最接近的列向量.
I am trying to use scikit's Nearest Neighbor implementation to find the closest column vectors to a given column vector, out of a matrix of random values.
该代码应该找到第21列的最近邻居,然后根据第21列检查这些邻居的实际余弦相似度.
This code is supposed to find the nearest neighbors of column 21 then check the actual cosine similarity of those neighbors against column 21.
from sklearn.neighbors import NearestNeighbors
import sklearn.metrics.pairwise as smp
import numpy as np
test=np.random.randint(0,5,(50,50))
nbrs = NearestNeighbors(n_neighbors=5, algorithm='auto', metric=smp.cosine_similarity).fit(test)
distances, indices = nbrs.kneighbors(test)
x=21
for idx,d in enumerate(indices[x]):
sim2 = smp.cosine_similarity(test[:,x],test[:,d])
print "sklearns cosine similarity would be ", sim2
print 'sklearns reported distance is', distances[x][idx]
print 'sklearns if that distance was cosine, the similarity would be: ' ,1- distances[x][idx]
输出类似于
sklearns cosine similarity would be [[ 0.66190748]]
sklearns reported distance is 0.616586738214
sklearns if that distance was cosine, the similarity would be: 0.383413261786
因此,邻居的输出既不是余弦距离也不是余弦相似度.有什么作用?
So the output of kneighbors is neither the cosine distance or the cosine similarity. What gives?
此外,此外,我认为sklearn的Nearest Neighbors实现不是近似最近邻居"方法,但是与迭代时得到的结果相比,它似乎未检测到数据集中实际的最佳邻居.矩阵,并检查列211与所有其他列的相似性.我在这里误解了一些基本的东西吗?
Also, as an aside, I thought sklearn's Nearest Neighbors implementation was not an Approximate Nearest Neighbors approach, yet it doesn't seem to detect the actual best neighbors in my dataset, compared to the results I get if i iterate over the matrix and check the similarities of column 211 to all the other ones. Am I misunderstanding something basic here?
推荐答案
好吧,问题在于NearestNeighbors的.fit()方法默认情况下假设行是示例而列是要素.在传递矩阵使其适合之前,我必须对矩阵进行转置.
Ok the problem was that NearestNeighbors's .fit() method, by default assumes the rows are samples and the columns are features. I had to tranpose the matrix before passing it to fit.
另外,另一个问题是作为度量传递的可调用对象应该是距离可调用对象,而不是相似性可调用对象.否则,您将得到K最远邻居:/
Also, another problem is that the callable passed as metric should be a distance callable, not a similarity callable. Otherwise you'll get the K Farthest Neighbors :/
这篇关于为什么scikit-learn的最近邻居似乎没有返回正确的余弦相似度距离?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!