本文介绍了无监督聚类数目不详的集群的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大组向量在3个维度。我需要基于欧几里德距离群集这些使得在任何特定群集中的所有矢量具有彼此小于阈的T。之间的欧几里得距离

I have a large set of vectors in 3 dimensions. I need to cluster these based on Euclidean distance such that all the vectors in any particular cluster have a Euclidean distance between each other less than a threshold "T".

我不知道集群多少存在。在结束时,有可能是现有单个矢量不属于任何簇的一部分,因为它的欧几里德距离不小于T的少的任何空间中的矢量的

I do not know how many clusters exist. At the end, there may be individual vectors existing that are not part of any cluster because its euclidean distance is not less than "T" with any of the vectors in the space.

什么现有的算法/方法应该用在这里?

What existing algorithms / approach should be used here?

谢谢阿布舍克小号

推荐答案

您可以使用层次聚类。这是一个相当基本的方法,所以有很多可用的实现。这是例如包括在Python的 SciPy的

You can use hierarchical clustering. It is a rather basic approach, so there are lots of implementations available. It is for example included in Python's scipy.

例如见下面的脚本:

import matplotlib.pyplot as plt
import numpy
import scipy.cluster.hierarchy as hcluster

# generate 3 clusters of each around 100 points and an orphan vector
N=100
data = numpy.random.randn(3*N,2)
data[:N] += 5
data[-N:] += 10
data[-1:] -= 20

# clustering
thresh = 1.5
clusters = hcluster.fclusterdata(data, thresh, criterion="distance")

# plotting
plt.scatter(*numpy.transpose(data), c=clusters)
plt.axis("equal")
title = "threshold: %f, number of clusters: %d" % (thresh, len(set(clusters)))
plt.title(title)
plt.show()

这会产生类似下图的结果。

Which produces a result similar to the following image.

作为参数的阈值是在此基础上作出的决策点/集群是否将被合并到另一个群集的距离值。所使用的距离度量还可以指定

The threshold given as a parameter is a distance value on which basis the decision is made whether points/clusters will be merged into another cluster. The distance metric being used can also be specified.

请注意,有各种方法如何计算帧内/簇间的相似性,例如最接近点之间的距离,最远点之间的距离,距离的聚类中心等。其中的一些方法也支持scipys层次聚类模块(单/完整/平均...联动)。根据您的文章,我想你会想使用完整的联动

Note that there are various methods for how to compute the intra-/inter-cluster similarity, e.g. distance between the closest points, distance between the furthest points, distance to the cluster centers and so on. Some of these methods are also supported by scipys hierarchical clustering module (single/complete/average... linkage). According to your post I think you would want to use complete linkage.

请注意,这种方法也允许小(单点)的集群,如果他们不符合其他簇的相似性准则,即距离阈值

Note that this approach also allows small (single point) clusters if they don't meet the similarity criterion of the other clusters, i.e. the distance threshold.

有其他的算法,将有更好的表现,这将成为相关的,有很多的数据点的情况。至于其他的答案/意见建议你可能也想看看的DBSCAN算法:

There are other algorithms that will perform better, which will become relevant in situations with lots of data points. As other answers/comments suggest you might also want to have a look at the DBSCAN algorithm:

  • https://en.wikipedia.org/wiki/DBSCAN
  • http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html
  • http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN

有关这些和其他的聚类算法一个很好的概述,也看看这个演示页(Python的scikit学习库):

For a nice overview on these and other clustering algorithms, also have a look at this demo page (of Python's scikit-learn library):

图片从那个地方复制:

正如可以看到的,每个算法使得大约需要加以考虑的簇的数目和形状一些假设。由算法或参数指定明确的假设强加给它隐含的假设。

As you can see, each algorithm makes some assumptions about the number and shape of the clusters that need to be taken into account. Be it implicit assumptions imposed by the algorithm or explicit assumptions specified by parameterization.

这篇关于无监督聚类数目不详的集群的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-11 16:20