问题描述
可能重复:
K-means算法的变化与平等的簇大小
编辑:像casperOne指出来我这个问题是重复的。反正在这里的是,包括这一个更广义的问题:http://stats.stackexchange.com/questions/8744/clustering-procedure-where-each-cluster-has-an-equal-number-of-points
like casperOne point it out to me this question is a duplicate. Anyways here is a more generalized question that cover this one: http://stats.stackexchange.com/questions/8744/clustering-procedure-where-each-cluster-has-an-equal-number-of-points
我的要求
在一个项目中,我需要组n个点(X,Y)的k个簇大小相等(N / K)的。其中x和y是双浮动数字,n可以为100〜10000且k的范围从2至100。而且k被该算法运行之前已知
In a project I need to group n points (x,y) in k clusters of equal size (n / k). Where x and y are double floating numbers, n can range from 100 to 10000 and k can range from 2 to 100. Also k is known before the algorithm runs.
我experimentations
我开始解决使用http://en.wikipedia.org/wiki/K-means_clustering算法,它工作的伟大,快速地产生大致相同大小的k个簇。
I started to resolve the problem by using the http://en.wikipedia.org/wiki/K-means_clustering algorithm, which work great and fast to produce exactly k clusters of roughly the same size.
但我的问题是这样的,K-均值产生集群大致相同的大小,我需要集群是完全相同的大小(或更precise的:我需要他们有楼之间的尺寸(N / K)和CEIL(N / K))。
But my problem is this, K-means produce clusters of roughly the same size, where I need the clusters to be exactly the same size (or to be more precise: I need them to have a size between floor(n / k) and ceil(n / k)).
在你点出来给我,是我尝试的第一个答案这里K-means算法的变化与平等的簇大小,这听起来是个好主意。
Before you point it out to me, yes I tried the first answer here K-means algorithm variation with equal cluster size, which sounds like a good idea.
其主要思想是后期处理的集群产品由K-均值阵列。从最大群集到最小。我们减少具有超过N / K成员通过移动加分的其他最近的聚类簇的大小。只留下那些已经降低了集群。
The main idea is to post process the array of cluster produce by K-means. From the biggest cluster up to the smallest. We reduce the size of the clusters that have more than n / k members by moving extra points to an other nearest cluster. Leaving alone the clusters that are already reduced.
下面是伪code我实现了:
Here is the pseudo code I implemented:
n is the number of point
k is the number of cluster
m = n / k (the ideal cluster size)
c is the array of cluster after K-means
c' = c sorted by size in descending order
for each cluster i in c' where i = 1 to k - 1
n = size of cluster i - m (the number of point to move)
loop n times
find a point p in cluster i with minimal distance to a cluster j in c' where j > i
move point p from cluster i to cluster j
end loop
recalculate centroids
end for each
这个算法的问题是,接近过程结束(当我接近K),我们必须选择一个群集的J C'(其中j>我,因为我们需要独自离开集群已处理),但我们发现这组J可以获得远离第一组,从而打破集群的概念。
The problem with this algorithm is that near the end of the process (when i come close to k), we have to choose a cluster j in c' (where j > i because we need to leave alone the clusters already processed), but this cluster j we found can be far from cluster i, thus breaking the concept of cluster.
我的提问
有后K-means算法或K-均值变体可以达到我的要求,还是我从一开始就错了,我需要找到一个其他的聚类算法?
Is there a post K-means algorithm or a K-means variant that can meet my requirements, or am I wrong from the beginning and I need to find an other clustering algorithm?
PS:我不介意实施解决方案我自己,但是这将是巨大的,如果我可以用一个库,最好在JAVA
PS: I do not mind to implement the solution myself, but it would be great if I can use a library, and ideally in JAVA.
推荐答案
试试这个k均值变化:
初始化
- 选择
K
中心从数据集中随意,甚至更好的使用k均值++策略 - 在每个点,计算距离它最近的聚类中心,并建立一个堆这种
- 从堆画点,并将它们分配到最近的簇,除非集群已经是过多的。如果是这样,计算下一个最近的聚类中心,重新插入堆
- choose
k
centers from the dataset at random, or even better using kmeans++ strategy - for each point, compute the distance to its nearest cluster center, and build a heap for this
- draw points from the heap, and assign them to the nearest cluster, unless the cluster is already overfull. If so, compute the next nearest cluster center and reinsert into the heap
在最后,你应该有一个paritioning满足你对+的要求-1每个群集对象(请确认最后几簇也有正确的数量。相同数量的第一个 M
集群应该有 CEIL
的对象,其余的完全地板
的对象。)请注意,使用堆确保集群保持凸:如果他们不再是凸的,也将是一个更好的交换人选
In the end, you should have a paritioning that satisfies your requirements of the +-1 same number of objects per cluster (make sure the last few clusters also have the right number. The first m
clusters should have ceil
objects, the remainder exactly floor
objects.)Note that using a heap ensures the clusters remain convex: if they were no longer convex, there would have been a better swap candidate.
迭代步
要件:用交换建议(对象,将preFER是在一个不同的簇)
Requisites: a list for each cluster with "swap proposals" (objects that would prefer to be in a different cluster).
电子邮件步:计算更新聚类中心作为常规k均值
E step: compute the updated cluster centers as in regular k-means
M 步:通过迭代所有点(或者只有一个,或都在一个批次)
M step: Iterating through all points (either just one, or all in one batch)
计算最近的聚类中心对象/,比目前的集群接近所有的集群中心。如果它是一个不同的簇:
Compute nearest cluster center to object / all cluster centers that are closer than the current clusters. If it is a different cluster:
- 如果其他群集比当前集群较小,只是将其移动到新的集群
- 如果有来自其它簇(或以较低的距离的任何簇)的交换建议,交换两个元件群集分配(如果有一个以上的报价,选择一个具有最大改善)LI>
- 否则,表明交换提议为其他集群
的簇大小保持不变(+ - 所述细胞/楼层差),一个对象只从一个簇,只要它导致估计的改进移动到另一个。因此,它应在某一点收敛像K-手段。这可能是一个有点慢(即更多的迭代),虽然。
The cluster sizes remain invariant (+- the ceil/floor difference), an objects are only moved from one cluster to another as long as it results in an improvement of the estimation. It should therefore converge at some point like k-means. It might be a bit slower (i.e. more iterations) though.
我不知道这是否已经公布或者实施过。这正是我会尝试(如果我想尝试K-均值。还有更好的聚类算法。)
I do not know if this has been published or implemented before. It's just what I would try (if I would try k-means. there are much better clustering algorithms.)
这篇关于在k个簇大小相等的组n个点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!