问题描述
我已经阅读了文档此处,以及查看本教程,但我仍然缺少一些有关在以下环境中使用K-means的基本知识scikit学习:
I've read the docshere as well as looking at this tutorial, but I am still missing something fundamental about using K-means in scikit-learn:
说我有一个这样的数据集:
Say I have a dataset as such:
|UserName| Variable1 | Variable2 | Variable3 | Cluster |
| bob | 1 | 3 | 7 | |
| joe | 2 | 4 | 8 | |
| bill | 1 | 6 | 4 | |
由于K均值采用一个numpy数组,因此我必须去除用户名,而仅使用数字变量.但是,在创建群集之后,如何将它们重新关联到每个用户,以进行进一步的分析.即我如何用相应的群集号填充群集"列?
Since K-means takes a numpy array I have to strip out the username and just use the numerical variables. But, after the clusters have been created how do I relate them back to each individual user for further analysis. I.e how would I fill the "Cluster" column with the corresponding cluster number?
推荐答案
下面是一个示例,假设您将数据从文件中读取到列表中:
Here's an example, assuming you read the data into a list from file:
import sklearn.cluster
import numpy as np
data = [
['bob', 1, 3, 7],
['joe', 2, 4, 8],
['bill', 1, 6, 4],
]
labels = [x[0] for x in data]
a = np.array([x[1:] for x in data])
clust_centers = 2
model = sklearn.cluster.k_means(a, clust_centers)
模型现在包含一个具有(质心,标签,间质)的元组
model now contains a tuple with (centroids, labels, intertia)
所以像这样重新获得标签:
So get the labels back like this:
clusters = dict(zip(lables, model[1]))
并打印"one"的集群ID:
And to print the cluster id for 'one':
print clusters['bob']
或将其发送回csv,如下所示:
Or send it back out to a csv like this:
for d in data:
print '%s,%d' % (','.join([str(x) for x in d]), clusters[d[0]])
这篇关于Python将k-means集群关联到实例的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!