问题描述
是否可以在scikit learning的KNeighborsClassifier中使用类似1的余弦值?
Is it possible to use something like 1 - cosine similarity with scikit learn's KNeighborsClassifier?
此回答为否,但在文档中KNeighborsClassifier,表示 DistanceMetrics 中提到的指标.距离度量标准不包括明确的余弦距离,可能是因为它并不是真正的距离,但是可以将一个函数输入到度量标准中.我尝试将scikit学习线性内核输入到KNeighborsClassifier中,但是给我一个错误,该函数需要两个数组作为参数.还有其他人尝试过吗?
This answer says no, but on the documentation for KNeighborsClassifier, it says the metrics mentioned in DistanceMetrics are available. Distance metrics don't include an explicit cosine distance, probably because it's not really a distance, but supposedly it's possible to input a function into the metric. I tried inputting the scikit learn linear kernel into KNeighborsClassifier but it gives me an error that the function needs two arrays as arguments. Anyone else tried this?
推荐答案
余弦相似度通常定义为x y/(|| x || * || y ||),并且如果相同,则输出1;如果完全不同,则输出-1.从技术上讲,此定义不是度量标准,因此不能将其与球和kd树之类的加速结构一起使用.如果您强制scikit学习使用蛮力方法,那么如果您将其传递给自己的自定义距离度量对象,则应该能够将其用作距离.如果您想使用球树,可以使用多种方法将余弦相似度转换为有效的距离度量标准(可以在 JSAT库中找到一个 )
The cosine similarity is generally defined as x y / (||x|| * ||y||), and outputs 1 if they are the same and goes to -1 if they are completely different. This definition is not technically a metric, and so you can't use accelerating structures like ball and kd trees with it. If you force scikit learn to use the brute force approach, you should be able to use it as a distance if you pass it your own custom distance metric object. There are methods of transforming the cosine similarity into a valid distance metric if you would like to use ball trees (you can find one in the JSAT library)
请注意,x y/(|| x || * || y ||)=(x/|| x ||)( y/|| y ||).欧氏距离可以等效地写为sqrt(x x + y y-2 x y).如果在将每个数据点提供给KNeighborsClassifier之前对其进行归一化,则对于所有x
,将其设为x^T x = 1
.因此,欧氏距离将降为sqrt(2 − 2x^T y)
.对于完全相同的输入,我们将得到sqrt(2-2*1) = 0
,而对于完全相反的内容将得到sqrt(2-2*-1)= 2
.而且它显然是一个简单的形状,因此可以通过对数据进行归一化然后使用欧几里得距离来获得与余弦距离相同的顺序.只要使用uniform
权重选项,结果将与使用正确的余弦距离相同.
Notice though, that x y / (||x|| * ||y||) = (x/||x||) (y/||y||). The euclidean distance can be equivalently written as sqrt(xx + yy − 2 xy). If we normalize every datapoint before giving it to the KNeighborsClassifier, then x^T x = 1
for all x
. So the euclidean distance will degrade to sqrt(2 − 2x^T y)
. For completely the same inputs, we would get sqrt(2-2*1) = 0
and for complete opposites sqrt(2-2*-1)= 2
. And it is clearly a simple shape, so you can get the same ordering as the cosine distance by normalizing your data and then using the euclidean distance. So long as you use the uniform
weights option, the results will be identical to having used a correct Cosine Distance.
这篇关于使用scikit使用余弦距离学习KNeighborsClassifier的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!