问题描述
我试图将聚集聚类与自定义距离度量(即亲和力)一起使用,因为我想通过序列相似性而不是无意义的欧几里德距离对整数序列进行聚类.
I'm trying to use agglomerative clustering with a custom distance metric (ie affinity) since I'd like to cluster a sequence of integers by sequence similarity and not something like the euclidean distance which isn't meaningful.
我的数据看起来像这样
>> dat.values
array([[860, 261, 240, ..., 300, 241, 1],
[860, 840, 860, ..., 860, 240, 1],
[260, 860, 260, ..., 260, 220, 1],
...,
[260, 260, 260, ..., 260, 260, 1],
[260, 860, 260, ..., 840, 860, 1],
[280, 240, 241, ..., 240, 260, 1]])
我创建了以下相似性函数
I've created the following similarity function
def sim(x, y):
return np.sum(np.equal(np.array(x), np.array(y)))/len(x)
所以我只用numpy返回两个序列中的%匹配值并进行以下调用
So I just return the % matching values in the two sequences with numpy and make the following call
cluster = AgglomerativeClustering(n_clusters=5, affinity=sim, linkage='average')
cluster.fit(dat.values)
但是我说错了
TypeError: sim() missing 1 required positional argument: 'y'
我不确定为什么会出现此错误;我以为该函数会将行对成簇,因此将传递每个必需的参数.
I'm not sure why I'm getting this error; I thought the function will cluster pairs of rows so each required argument would be passed.
任何对此的帮助将不胜感激
Any help with this would be greatly appreciated
推荐答案
'affinity'
作为可调用对象需要单个输入 X
(这是您的特征或观察矩阵)),然后调用其中所有点(样本)之间的距离.
'affinity'
as a callable requires a single input X
(which is your feature or observation matrix) and then call the distances between all the points (samples) inside it.
因此,您需要将方法修改为:
So you need to modify your method as:
# Your method to calculate distance between two samples
def sim(x, y):
return np.sum(np.equal(np.array(x), np.array(y)))/len(x)
# Method to calculate distances between all sample pairs
from sklearn.metrics import pairwise_distances
def sim_affinity(X):
return pairwise_distances(X, metric=sim)
cluster = AgglomerativeClustering(n_clusters=5, affinity=sim_affinity, linkage='average')
cluster.fit(X)
或者您可以按照@avchauzov的建议使用 affinity ='precomputed'
.为此,您必须在 fit()
中传递用于观察的预先计算的距离矩阵.像这样:
Or you can use affinity='precomputed'
as @avchauzov has suggested. For that you will have to pass the pre-calculated distance matrix for your observations in fit()
. Something like:
cluster = AgglomerativeClustering(n_clusters=5, affinity='precomputed', linkage='average')
distance_matrix = sim_affinity(X)
cluster.fit(distance_matrix)
注意:您已指定相似性代替了距离.因此,请确保您了解群集在此处的工作方式.或者也许调整您的相似度函数以返回距离.像这样:
Note: You have specified similarity in place of distance. So make sure you understand how the clustering will work here. Or maybe tweak your similarity function to return distance. Something like:
def sim(x, y):
# Subtracted from 1.0 (highest similarity), so now it represents distance
return 1.0 - np.sum(np.equal(np.array(x), np.array(y)))/len(x)
这篇关于Sklearn聚集聚类自定义亲和力的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!