问题描述
我在理解 sklearn.cluster.SpectralClustering
类的特定用例时遇到麻烦,如官方文档。假设我想使用自己的亲和力矩阵执行聚类。我首先实例化类 SpectralClustering
的对象,如下所示:
I'm having trouble understanding a specific use case of the sklearn.cluster.SpectralClustering
class as outlined in the official documentation here. Say I want to use my own affinity matrix to perform clustering. I first instantiate an object of class SpectralClustering
as follows:
from sklearn.clustering import SpectralClustering
cl = SpectralClustering(n_clusters=5,affinity='precomputed')
上面的 affinity
参数的文档如下:
The documentation for the affinity
parameter above is as follows:
如果为字符串,则可能是 nearest_neighbors, precomputed之一, rbf或sklearn.metrics.pairwise_kernels支持的内核之一。
仅应使用产生相似性得分(随相似性增加的非负值)的内核。群集算法不会检查此属性。
If a string, this may be one of ‘nearest_neighbors’, ‘precomputed’, ‘rbf’ or one of the kernels supported by sklearn.metrics.pairwise_kernels. Only kernels that produce similarity scores (non-negative values that increase with similarity) should be used. This property is not checked by the clustering algorithm.
现在,对象 cl
具有方法 fit
,有关其唯一参数 X
的文档如下:
Now the object cl
has a method fit
for which the documentation about its sole parameter X
is as follows:
或,如果亲和力= = 预先计算的
,形状为(n_samples,n_samples)的预先计算的亲和力矩阵
OR, if affinity==precomputed
, a precomputed affinity matrix of shape (n_samples, n_samples)
这就是令人困惑的地方。我正在使用自己的亲和力矩阵,其中的0表示两个点是相同的,数字越大表示两个点之间的相异性越高。但是,参数 affinity
的其他选择实际上是采用一个数据集并生成一个相似度矩阵,为此 higher 值表示相似度更高,而较低值指示相异性(例如径向基核)。
This is where it gets confusing. I am using my own affinity matrix, where a measure of 0 means two points are identical, with a higher number meaning two points are more dissimilar. However, the other choices for the parameter affinity
actually take a data set and produce a similarity matrix, for which higher values are indicative of more similarity, and lower values indicate dissimilarity (such as the radial basis kernel).
因此,当使用<$ c我的 SpectralClustering
实例上的$ c> fit 方法,在将其传递给 fit
方法调用为参数 X
吗?同一文档页面上有关于将距离转换为行为相似的注释,但没有明确指出应在何处执行此步骤以及通过哪种方法调用。
So when using the fit
method on my instance of SpectralClustering
, do I actually need to transform my affinity matrix into a similarity matrix before passing it to the fit
method call as the parameter X
? The same documentation page makes a note on transforming distance to well-behaved similarities, but does not explicitly indicate where this step should should be carried out, and via which method call.
推荐答案
直接从文档中查找:
np.exp(- X ** 2 / (2. * delta ** 2))
这在您自己的代码中进行,其结果可以传递给 fit
。就此算法而言,亲和度表示相似度,而不是距离。
This goes in your own code, and the result of this can be passed to fit
. For the purpose of this algorithm, affinity means similarity, not distance.
这篇关于使用类sklearn.cluster.SpectralClustering和参数affinity ='precomputed'的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!