r - K-均值:Lloyd，Forgy，MacQueen，Hartigan-Wong

我正在使用R中的K-Means算法，我想弄清楚stats软件包中的“kmeans”函数可以使用的4种算法Lloyd，Forgy，MacQueen和Hartigan-Wong的区别。

但是，我值得注意的是，这个问题得到了足够的答案。

我只发现了一些很少的信息:
(访问http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/K-Means)

从这个描述中，劳埃德(Lloyd)，福吉(Forgy)和哈蒂根·王(Hartigan-Wong)在我看来都是一样的。最小化平方和以内或最小化欧氏距离是相同的。

如果我将对象移动到另一个群集，MacQueen会更新两个涉及的群集，这是不同的。

但是，我仍然看不到这些算法在哪些方面有所不同。

最佳答案

R 提供Lloyd算法作为kmeans()的选项；默认算法，由
Hartigan和Wong(1979)更聪明。就像MacQueen的算法(MacQueen，1967年)一样，
每当移动点时，它都会更新质心；它还可以做出明智的选择(省时)
在检查最近的群集。另一方面，劳埃德(Lloyd)的k-means算法是所有这些聚类算法中的第一个也是最简单的。

劳埃德(Lloyd，1957)的算法采用了一组观测值或案例(例如:
nxp矩阵或Reals中的点)并将它们聚类为k组。它试图最小化
集群内平方和

其中u_i是群集S_i中所有点的平均值。该算法进行如下(我将
避免为您提供详尽的符号形式):

但是，R的实现存在问题，当
考虑多个起点。我应该注意，通常谨慎考虑
几个不同的起点，因为可以保证算法收敛，但不能收敛
保证覆盖全局最佳状态。对于大型，高维的情况尤其如此
问题。我将从一个简单的示例开始(大型的，不是特别困难的)。

(这里我将粘贴一些图像，因为我们无法使用 latex 编写数学公式)

请注意，该解决方案与之前实现的解决方案非常相似，尽管
簇是任意的。更重要的是，作业并行仅需0.199秒!一定
这太好了，难以置信:使用3个处理器内核最多应该占用三分之一的处理器内核。
我们的第一个(顺序)运行时间。这有问题吗？听起来像免费午餐。没有
偶尔享用免费午餐有问题吗？

这并不总是适用于R函数，但有时我们有机会直接看一下
在代码。这是那些时代之一。我将这段代码放入文件mykmeans.R，
并手动编辑它，在不同位置插入cat()语句。这是一个聪明的方法
这可以通过使用sink()进行(尽管这似乎在Sweave中不起作用，但可以交互地起作用):

> sink("mykmeans.R")
> kmeans
> sink()

现在编辑文件，更改函数名称并添加cat()语句。注意
您还必须删除尾行:

然后，我们可以重复使用mykmeans()进行探索:

> source("mykmeans.R")
> start.kmeans <- proc.time()[3]
> ans.kmeans <- mykmeans(x, 4, nstart = 3, iter.max = 10, algorithm = "Lloyd")
JJJ statement 1: 0 elapsed time.
JJJ statement 5: 2.424 elapsed time.
JJJ statement 6: 2.425 elapsed time.
JJJ statement 7: 2.52 elapsed time.
JJJ statement 6: 2.52 elapsed time.
JJJ statement 7: 2.563 elapsed time.

现在我们开始做生意:大部分时间都花在声明5之前(我知道
当然，这就是为什么语句5是5而不是2)...
你可以继续玩

这是代码:

#######################################################################
# kmeans()

N <- 100000
x <- matrix(0, N, 2)
x[seq(1,N,by=4),] <- rnorm(N/2)
x[seq(2,N,by=4),] <- rnorm(N/2, 3, 1)
x[seq(3,N,by=4),] <- rnorm(N/2, -3, 1)
x[seq(4,N,by=4),1] <- rnorm(N/4, 2, 1)
x[seq(4,N,by=4),2] <- rnorm(N/4, -2.5, 1)
start.kmeans <- proc.time()[3]
ans.kmeans <- kmeans(x, 4, nstart=3, iter.max=10, algorithm="Lloyd")
ans.kmeans$centers
end.kmeans <- proc.time()[3]
end.kmeans - start.kmeans

these <- sample(1:nrow(x), 10000)
plot(x[these,1], x[these,2], pch=".")
points(ans.kmeans$centers, pch=19, cex=2, col=1:4)

library(foreach)
library(doMC)
registerDoMC(3)
start.kmeans <- proc.time()[3]
ans.kmeans.par <- foreach(i=1:3) %dopar% {
  return(kmeans(x, 4, nstart=1, iter.max=10, algorithm="Lloyd"))
}
TSS <- sapply(ans.kmeans.par, function(a) return(sum(a$withinss)))
ans.kmeans.par <- ans.kmeans.par[[which.min(TSS)]]
ans.kmeans.par$centers
end.kmeans <- proc.time()[3]
end.kmeans - start.kmeans

sink("mykmeans.Rfake")
kmeans
sink()

source("mykmeans.R")
start.kmeans <- proc.time()[3]
ans.kmeans <- mykmeans(x, 4, nstart=3, iter.max=10, algorithm="Lloyd")
ans.kmeans$centers
end.kmeans <- proc.time()[3]
end.kmeans - start.kmeans

#######################################################################
# Diving

x <- read.csv("Diving2000.csv", header=TRUE, as.is=TRUE)
library(YaleToolkit)
whatis(x)

x[1:14,c(3,6:9)]

meancol <- function(scores) {
  temp <- matrix(scores, length(scores)/7, ncol=7)
  means <- apply(temp, 1, mean)
  ans <- rep(means,7)
  return(ans)
}
x$panelmean <- meancol(x$JScore)

x[1:14,c(3,6:9,11)]

meancol <- function(scores) {
  browser()
  temp <- matrix(scores, length(scores)/7, ncol=7)
  means <- apply(temp, 1, mean)
  ans <- rep(means,7)
  return(ans)
}

x$panelmean <- meancol(x$JScore)

这是说明:

Number of cases: 10,787 scores from 1,541 dives (7 judges score each
dive) performed in four events at the 2000 Olympic Games in Sydney,
Australia.

Number of variables: 10.

Description: A full description and analysis is available in an
article in The American Statistician (publication details to be
announced).

Variables:

Event       Four events, men's and women's 3M and 10m.
Round       Preliminary, semifinal, and final rounds.
Diver       The name of the diver.
Country     The country of the diver.
Rank        The final rank of the diver in the event.
DiveNo      The number of the dive in sequence within round.
Difficulty  The degree of difficulty of the dive.
JScore      The score provided for the judge on this dive.
Judge       The name of the judge.
JCountry    The country of the judge.

和数据集尝试https://www.dropbox.com/s/urgzagv0a22114n/Diving2000.csv