r - 用k均值算法进行异常值检测

我希望你能帮助我解决我的问题。我正在尝试使用kmeans算法检测异常值。首先，我执行算法并选择与聚类中心有较大距离的那些离群值。我不想使用绝对距离，而是要使用相对距离，即对象到群集中心的绝对距离的比率以及群集中所有对象到群集中心的平均距离。基于绝对距离的异常值检测代码如下：

# remove species from the data to cluster
iris2 <- iris[,1:4]
kmeans.result <- kmeans(iris2, centers=3)
# cluster centers
kmeans.result$centers
# calculate distances between objects and cluster centers
centers <- kmeans.result$centers[kmeans.result$cluster, ]
distances <- sqrt(rowSums((iris2 - centers)^2))
# pick top 5 largest distances
outliers <- order(distances, decreasing=T)[1:5]
# who are outliers
print(outliers)

但是，如何使用相对距离而不是绝对距离来找到离群值？

最佳答案

您只需要计算每个观测值到其聚类的平均距离。您已经有了这些距离，因此只需要平均即可。然后剩下的就是简单的索引划分：

# calculate mean distances by cluster:
m <- tapply(distances, kmeans.result$cluster,mean)

# divide each distance by the mean for its cluster:
d <- distances/(m[kmeans.result$cluster])

您的异常值：

> d[order(d, decreasing=TRUE)][1:5]
       2        3        3        1        3
2.706694 2.485078 2.462511 2.388035 2.354807

关于r - 用k均值算法进行异常值检测，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/23516358/