问题描述
我有一个由 70,000 个数值组成的数据集,代表从 0 到 50 的距离,我想对这些数字进行聚类;但是,如果我正在尝试经典的聚类方法,那么我将不得不建立一个 70,000X70,000 的距离矩阵来表示我的数据集中每两个数字之间的距离,这不适合内存,所以我想知道是否有有什么聪明的方法可以解决这个问题而无需进行分层抽样?我也在 R 中尝试过 bigmemory 和 big analytics 库,但仍然无法将数据放入内存
I have a dataset consisting of 70,000 numeric values representing distances ranging from 0 till 50, and I want to cluster these numbers; however, if I'm trying the classical clustering approach, then I would have to establish a 70,000X70,000 distance matrix representing the distances between each two numbers in my dataset, which won't fit in memory, so I was wondering if there is any smart way to solve this problem without the need to do stratified sampling?I also tried bigmemory and big analytics libraries in R but still can't fit the data into memory
推荐答案
您可以使用 kmeans
来计算一个重要的中心数(1000、2000、...) 并对这些中心的坐标执行分层聚类方法.这样距离矩阵会更小.
You can use kmeans
, which normally suitable for this amount of data, to calculate an important number of centers (1000, 2000, ...) and perform a hierarchical clustering approach on the coordinates of these centers.Like this the distance matrix will be smaller.
## Example
# Data
x <- rbind(matrix(rnorm(70000, sd = 0.3), ncol = 2),
matrix(rnorm(70000, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
# CAH without kmeans : dont work necessarily
library(FactoMineR)
cah.test <- HCPC(x, graph=FALSE, nb.clust=-1)
# CAH with kmeans : work quickly
cl <- kmeans(x, 1000, iter.max=20)
cah <- HCPC(cl$centers, graph=FALSE, nb.clust=-1)
plot.HCPC(cah, choice="tree")
这篇关于在 R 中聚类非常大的数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!