本文介绍了我如何获得由 scipy.cluster.hierarchy 制作的树状图的子树的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对这个模块 (scipy.cluster.hierarchy) 感到困惑……但仍然有一些!

例如,我们有以下树状图:

我的问题是如何以一种很好的格式(例如 SIF 格式)提取彩色子树(每个子树代表一个集群)?现在得到上图的代码是:

导入scipy将 scipy.cluster.hierarchy 导入为 sch导入 matplotlib.pylab 作为 pltscipy.randn(100,2)d = sch.distance.pdist(X)Z= sch.linkage(d,method='complete')P =sch.dendrogram(Z)plt.savefig('plot_dendrogram.png')T = sch.fcluster(Z, 0.5*d.max(), '距离')#array([4, 5, 3, 2, 2, 3, 5, 2, 2, 5, 2, 2, 2, 3, 2, 3, 2, 5, 4, 5, 2, 5, 2,# 3, 3, 3, 1, 3, 4, 2, 2, 4, 2, 4, 3, 3, 2, 5, 5, 5, 3, 2, 2, 2, 5, 4,# 2, 4, 2, 2, 5, 5, 1, 2, 3, 2, 2, 5, 4, 2, 5, 4, 3, 5, 4, 4, 2, 2, 2,# 4, 2, 5, 2, 2, 3, 3, 2, 4, 5, 3, 4, 4, 2, 1, 5, 4, 2, 2, 5, 5, 2, 2,# 5, 5, 5, 4, 3, 3, 2, 4], dtype=int32)sch.leaders(Z,T)# (数组([190, 191, 182, 193, 194], dtype=int32),# 数组([2, 3, 1, 4,5],dtype=int32))

现在,fcluster() 的输出给出了节点的聚类(通过它们的 id),并且 leaders() 描述了

  • 如何在 scipy/matplotlib 中绘制和注释层次聚类树状图
  • I had a confusion regarding this module (scipy.cluster.hierarchy) ... and still have some !

    For example we have the following dendrogram:

    My question is how can I extract the coloured subtrees (each one represent a cluster) in a nice format, say SIF format ?Now the code to get the plot above is:

    import scipy
    import scipy.cluster.hierarchy as sch
    import matplotlib.pylab as plt
    
    scipy.randn(100,2)
    
    d = sch.distance.pdist(X)
    
    Z= sch.linkage(d,method='complete')
    
    P =sch.dendrogram(Z)
    
    plt.savefig('plot_dendrogram.png')
    
    T = sch.fcluster(Z, 0.5*d.max(), 'distance')
    #array([4, 5, 3, 2, 2, 3, 5, 2, 2, 5, 2, 2, 2, 3, 2, 3, 2, 5, 4, 5, 2, 5, 2,
    #       3, 3, 3, 1, 3, 4, 2, 2, 4, 2, 4, 3, 3, 2, 5, 5, 5, 3, 2, 2, 2, 5, 4,
    #       2, 4, 2, 2, 5, 5, 1, 2, 3, 2, 2, 5, 4, 2, 5, 4, 3, 5, 4, 4, 2, 2, 2,
    #       4, 2, 5, 2, 2, 3, 3, 2, 4, 5, 3, 4, 4, 2, 1, 5, 4, 2, 2, 5, 5, 2, 2,
    #       5, 5, 5, 4, 3, 3, 2, 4], dtype=int32)
    
    sch.leaders(Z,T)
    # (array([190, 191, 182, 193, 194], dtype=int32),
    #  array([2, 3, 1, 4,5],dtype=int32))
    

    So now, the output of fcluster() gives the clustering of the nodes (by their id's), and leaders() described here is supposed to return 2 arrays:

    • first one contains the leader nodes of the clusters generated by Z, here we can see we have 5 clusters, as well as in the plot

    • and the second one the id's of these clusters

    So if this leaders() returns resp. L and M : L[2]=182 and M[2]=1, then cluster 1 is leaded by node id 182, which doesn't exist in the observations set X, the documentation says "... then it corresponds to a non-singleton cluster". But I can't get it ...

    Also, I converted the Z to a tree by sch.to_tree(Z), that will return an easy-to-use tree object, which I want to visualize, but which tool should I use as a graphical platform that manipulate these kind of tree objects as inputs?

    解决方案

    Answering the part of your question regarding tree manipulation...

    As explained in aother answer, you can read the coordinates of the branches reading icoord and dcoord from the tree object. For each branch the coordinated are given from the left to the right.

    If you want to manually plot the tree you can use something like:

    def plot_tree(P, pos=None):
        plt.clf()
        icoord = scipy.array(P['icoord'])
        dcoord = scipy.array(P['dcoord'])
        color_list = scipy.array(P['color_list'])
        xmin, xmax = icoord.min(), icoord.max()
        ymin, ymax = dcoord.min(), dcoord.max()
        if pos:
            icoord = icoord[pos]
            dcoord = dcoord[pos]
            color_list = color_list[pos]
        for xs, ys, color in zip(icoord, dcoord, color_list):
            plt.plot(xs, ys, color)
        plt.xlim(xmin-10, xmax + 0.1*abs(xmax))
        plt.ylim(ymin, ymax + 0.1*abs(ymax))
        plt.show()
    

    Where, in your code, plot_tree(P) gives:

    The function allows you to select just some branches:

    plot_tree(P, range(10))
    

    Now you have to know which branches to plot. Maybe the fcluster() output is a little obscure and another way to find which branches to plot based on a minimum and a maximum distance tolerance would be using the output of linkage() directly (Z in the OP's case):

    dmin = 0.2
    dmax = 0.3
    pos = scipy.all( (Z[:,2] >= dmin, Z[:,2] <= dmax), axis=0 ).nonzero()
    plot_tree( P, pos )
    

    Recommended references:

    这篇关于我如何获得由 scipy.cluster.hierarchy 制作的树状图的子树的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

    08-11 13:37