本文介绍了了解python中kmeans聚类的输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个距离矩阵,每个矩阵都是232 * 232,其中列和行的标签是相同的.因此,这将是两者的缩写形式,其中A,B,C和D是要测量距离的点的名称:

I have two distance matrices, each 232*232 where the column and row labels are identical. So this would be an abridged version of the two where A, B, C and D are the names of the points between which the distances are measured:

  A  B  C  D ...    A  B  C  D  ...
A 0  1  5  3      A 0  5  3  9
B 4  0  4  1      B 2  0  7  8
C 2  6  0  3      C 2  6  0  1
D 2  7  1  0      D 5  2  5  0
...               ...

因此,两个矩阵表示两个不同网络中的成对点之间的距离.我想确定在一个网络中彼此靠近而在另一个网络中相距较远的成对集群.我尝试通过首先将每个距离除以矩阵中的最大距离来调整每个矩阵中的距离,以实现此目的.然后,我从另一个矩阵中减去一个矩阵,并将聚类算法应用于结果矩阵.建议我使用的算法是k means算法.希望是我可以确定正数的簇,它们对应于在矩阵一中非常接近而在矩阵二中相距甚远的对,而对于负数的簇则相反.

The two matrices therefore represent the distances between pairs of points in two different networks. I want to identify clusters of pairs that are close together in one network and far apart in the other. I attempted to do this by first adjusting the distances in each matrix by dividing every distance by the largest distance in the matrix. I then subtracted one matrix from the other and applied a clustering algorithm to the resultant matrix. The algorithm I was advised to use for this was the k means algorithm. The hope was that I could identify clusters of positive numbers that would correspond to pairs that were very close in matrix one and far apart in matrix two and vice versa for clusters of negative numbers.

首先,我已经阅读了很多有关如何在python中实现k means的知识,我知道可以使用多个不同的模块.我已经尝试了所有这三个:

Firstly, I've read quite a bit about how to implement k means in python I'm aware that there are multiple different modules that can be used. I've tried all three of these:

1.

import sklearn.cluster
import numpy as np

data = np.load('difference_matrix_file.npy') #loads difference matrix from file

a = np.array([x[0:] for x in data])
clust_centers = 3

model = sklearn.cluster.k_means(a, clust_centers)
print model

2.

import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.cluster import KMeans

difference_matrix = np.load('difference_matrix_file.npy') #loads difference matrix from file

data = pd.DataFrame(difference_matrix)
model = KMeans(n_clusters=3)
print model.fit(data)

3.

import numpy as np
from scipy.cluster.vq import vq, kmeans, whiten

np.set_printoptions(threshold=np.nan)

difference_matrix = np.load('difference_matrix_file.npy') #loads difference matrix from file

whitened = whiten(difference_matrix)
centroids = kmeans(whitened, 3)
print centroids

我正在努力的是如何解释这些脚本的输出. (在这一点上,我可能会补充说,如果读者还没有猜到的话,我既不是数学家,也不是计算机科学家).我期望算法的输出是聚类对的坐标列表,在这种情况下,每个聚类对应一个坐标,因此每个聚类有3个,这样我就可以追溯到我的两个原始矩阵并标识感兴趣的对的名称.

What I'm struggling with is how to interpret the output from these scripts. (I might add at this point that I'm neither a mathematician nor a computer scientist if the reader hadn't already guessed). I was expecting the output of the algorithm to be lists of coordinates of clustered pairs, one for each cluster so three in this case, that I could then trace back to my two original matrices and identify the names of the pairs of interest.

但是我得到的是一个包含数字列表的数组(每个集群一个),但是我真的不明白这些数字是什么,它们显然与输入矩阵中的数字不符,除了事实上,每个列表中有232个项目,输入矩阵中的行和列数相同.数组中的列表项是另一个单个数字,我认为它必须是集群的质心,但是每个集群没有一个,整个数组只有一个.

However what I get is an array containing a list of numbers (one for each cluster) but I don't really understand what these numbers are, they don't obviously correspond to what I had in my input matrix other than the fact that there is 232 items in each list which is the same number of rows and columns there are in the input matrix. And the list item in the array is another single number which I presume must be the centroid of the clusters, but there isn't one for each cluster, just one for the whole array.

我已经尝试了好一阵子了,但我一直在努力争取到任何地方.每当我搜索解释kmeans的输出时,我都会得到关于如何在图上绘制聚类的说明,而这并不是我想要做的.请有人可以向我解释我在输出中看到的内容以及如何从中获取每个群集中项目的坐标吗?

I've been trying to figure this out for quite a while now but I'm struggling to get anywhere. Whenever I search for interpreting the output of kmeans I just get explanations of how to plot my clusters on a graph which isn't what I want to do. Please can someone explain to me what I'm seeing in my output and how I can get from this to the coordinates of the items in each cluster?

推荐答案

您有两个问题,其中对k均值的推荐可能不是很好...

You have two issues where, and the recommendation of k-means probably was not very good...

  1. K-means需要一个坐标数据矩阵,而不是一个距离矩阵.

为了计算质心,它需要原始坐标.如果没有这样的坐标,则可能不应该使用k均值.

In order to compute a centroid, it needs the original coordinates. If you don't have coordinates like this, you probably should not be using k-means.

如果计算两个距离矩阵的差,则较小的值对应于两个距离相似的点. 它们之间可能仍然相距很远!因此,如果将此矩阵用作新的距离"矩阵,则会得到毫无意义的结果.考虑两个原始图中距离最大的点A和B.完成您的操作后,它们之间的差为0,因此现在被视为相同.

If you compute the difference of two distance matrixes, small values correspond to points that have a similar distance in both. These could still be very far away from each other! So if you use this matrix as a new "distance" matrix, you will get meaningless results. Consider points A and B, which have the maximum distance in both original graphs. After your procedure, they will have a difference of 0, and will thus be considered identical now.

所以您还不了解k均值的输入,难怪您不了解k-means的输入.

So you haven't understood the input of k-means, no wonder you do not understand the output.

我宁愿将差异矩阵视为相似性矩阵(尝试绝对值,仅正值,仅负值).然后使用层次聚类.但是,您将需要实现相似度的实现,而 distance 矩阵的常规实现将无法正常工作.

I'd rather treat the difference matrix as a similarity matrix (try absolute values, positives only, negatives only). Then use hierarchical clustering. But you will need an implementation for a similarity, the usual implementations for a distance matrix will not work.

这篇关于了解python中kmeans聚类的输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-15 03:50