问题描述
我试图了解如何操作层次结构集群,但是文档太...技术性...了,我不明白它是如何工作的.
I'm trying to understand how to manipulate a hierarchy cluster but the documentation is too ... technical?... and I can't understand how it works.
是否有任何教程可以帮助我开始,逐步解释一些简单的任务?
Is there any tutorial that can help me to start with, explaining step by step some simple tasks?
假设我具有以下数据集:
Let's say I have the following data set:
a = np.array([[0, 0 ],
[1, 0 ],
[0, 1 ],
[1, 1 ],
[0.5, 0 ],
[0, 0.5],
[0.5, 0.5],
[2, 2 ],
[2, 3 ],
[3, 2 ],
[3, 3 ]])
我可以轻松地进行层次聚类并绘制树状图:
I can easily do the hierarchy cluster and plot the dendrogram:
z = linkage(a)
d = dendrogram(z)
- 现在,如何恢复特定群集?比方说树状图中元素
[0,1,2,4,5,6]
的那个? - 如何获取这些元素的值?
- Now, how I can recover a specific cluster? Let's say the one with elements
[0,1,2,4,5,6]
in the dendrogram? - How I can get back the values of that elements?
- 量化数据(
metric
参数) - 集群数据(
method
参数) - 选择集群数量
- Quantify Data (
metric
argument) - Cluster Data (
method
argument) - Choose the number of clusters
推荐答案
层次聚类聚类(HAC)包括三个步骤:
There are three steps in hierarchical agglomerative clustering (HAC):
做
z = linkage(a)
将完成前两个步骤.由于您未指定任何参数,因此使用标准值
will accomplish the first two steps. Since you did not specify any parameters it uses the standard values
-
metric = 'euclidean'
-
method = 'single'
metric = 'euclidean'
method = 'single'
因此,z = linkage(a)
将为您提供a
的单个链接的层次聚结聚类.这种群集是解决方案的一种层次结构.从该层次结构中,您可以获得有关数据结构的一些信息.您现在可以做的是:
So z = linkage(a)
will give you a single linked hierachical agglomerative clustering of a
. This clustering is kind of a hierarchy of solutions. From this hierarchy you get some information about the structure of your data. What you might do now is:
- 检查哪个
metric
是合适的,e. G.cityblock
或chebychev
将量化您的数据(cityblock
,euclidean
和chebychev
对应于L1
,L2
和L_inf
范数) - 检查
methdos
的不同属性/行为(例如single
,complete
和average
) - 检查如何确定群集数,例如G.通过阅读有关它的Wiki
- 计算找到的解决方案(集群)的索引,例如剪影系数(通过该系数,您可以获得有关点/观测值与聚类分配给它的聚类的匹配程度的质量的反馈).不同的索引使用不同的条件来限定聚类.
- Check which
metric
is appropriate, e. g.cityblock
orchebychev
will quantify your data differently (cityblock
,euclidean
andchebychev
correspond toL1
,L2
, andL_inf
norm) - Check the different properties / behaviours of the
methdos
(e. g.single
,complete
andaverage
) - Check how to determine the number of clusters, e. g. by reading the wiki about it
- Compute indices on the found solutions (clusterings) such as the silhouette coefficient (with this coefficient you get a feedback on the quality of how good a point/observation fits to the cluster it is assigned to by the clustering). Different indices use different criteria to qualify a clustering.
从这里开始
import numpy as np
import scipy.cluster.hierarchy as hac
import matplotlib.pyplot as plt
a = np.array([[0.1, 2.5],
[1.5, .4 ],
[0.3, 1 ],
[1 , .8 ],
[0.5, 0 ],
[0 , 0.5],
[0.5, 0.5],
[2.7, 2 ],
[2.2, 3.1],
[3 , 2 ],
[3.2, 1.3]])
fig, axes23 = plt.subplots(2, 3)
for method, axes in zip(['single', 'complete'], axes23):
z = hac.linkage(a, method=method)
# Plotting
axes[0].plot(range(1, len(z)+1), z[::-1, 2])
knee = np.diff(z[::-1, 2], 2)
axes[0].plot(range(2, len(z)), knee)
num_clust1 = knee.argmax() + 2
knee[knee.argmax()] = 0
num_clust2 = knee.argmax() + 2
axes[0].text(num_clust1, z[::-1, 2][num_clust1-1], 'possible\n<- knee point')
part1 = hac.fcluster(z, num_clust1, 'maxclust')
part2 = hac.fcluster(z, num_clust2, 'maxclust')
clr = ['#2200CC' ,'#D9007E' ,'#FF6600' ,'#FFCC00' ,'#ACE600' ,'#0099CC' ,
'#8900CC' ,'#FF0000' ,'#FF9900' ,'#FFFF00' ,'#00CC01' ,'#0055CC']
for part, ax in zip([part1, part2], axes[1:]):
for cluster in set(part):
ax.scatter(a[part == cluster, 0], a[part == cluster, 1],
color=clr[cluster])
m = '\n(method: {})'.format(method)
plt.setp(axes[0], title='Screeplot{}'.format(m), xlabel='partition',
ylabel='{}\ncluster distance'.format(m))
plt.setp(axes[1], title='{} Clusters'.format(num_clust1))
plt.setp(axes[2], title='{} Clusters'.format(num_clust2))
plt.tight_layout()
plt.show()
送礼
这篇关于scipy.cluster.hierarchy教程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!