问题描述
我正在尝试对Matrix(大小:20057x2)进行聚类.:
I am trying to cluster a Matrix (size: 20057x2).:
T = clusterdata(X,cutoff);
但我收到此错误:
??? Error using ==> pdistmex
Out of memory. Type HELP MEMORY for your options.
Error in ==> pdist at 211
Y = pdistmex(X',dist,additionalArg);
Error in ==> linkage at 139
Z = linkagemex(Y,method,pdistArg);
Error in ==> clusterdata at 88
Z = linkage(X,linkageargs{1},pdistargs);
Error in ==> kmeansTest at 2
T = clusterdata(X,1);
有人可以帮助我吗?我有4GB的ram,但认为问题出在其他地方.
can someone help me. I have 4GB of ram, but think that the problem is from somewhere else..
推荐答案
如其他人所述,层次聚类需要计算成对的距离矩阵,该矩阵太大而无法容纳在您的情况下.
As mentioned by others, hierarchical clustering needs to calculate the pairwise distance matrix which is too big to fit in memory in your case.
尝试改用K-Means算法:
Try using the K-Means algorithm instead:
numClusters = 4;
T = kmeans(X, numClusters);
或者,您可以选择数据的随机子集并将其用作聚类算法的输入.接下来,您将聚类中心计算为每个聚类组的平均值/中位数.最后,对于子集中未选择的每个实例,您只需计算其与每个质心的距离,然后将其分配给最接近的质心.
Alternatively you can select a random subset of your data and use as input to the clustering algorithm. Next you compute the cluster centers as mean/median of each cluster group. Finally for each instance that was not selected in the subset, you simply compute its distance to each of the centroids and assign it to the closest one.
下面是一个示例代码来说明上述想法:
Here's a sample code to illustrate the idea above:
%# random data
X = rand(25000, 2);
%# pick a subset
SUBSET_SIZE = 1000; %# subset size
ind = randperm(size(X,1));
data = X(ind(1:SUBSET_SIZE), :);
%# cluster the subset data
D = pdist(data, 'euclid');
T = linkage(D, 'ward');
CUTOFF = 0.6*max(T(:,3)); %# CUTOFF = 5;
C = cluster(T, 'criterion','distance', 'cutoff',CUTOFF);
K = length( unique(C) ); %# number of clusters found
%# visualize the hierarchy of clusters
figure(1)
h = dendrogram(T, 0, 'colorthreshold',CUTOFF);
set(h, 'LineWidth',2)
set(gca, 'XTickLabel',[], 'XTick',[])
%# plot the subset data colored by clusters
figure(2)
subplot(121), gscatter(data(:,1), data(:,2), C), axis tight
%# compute cluster centers
centers = zeros(K, size(data,2));
for i=1:size(data,2)
centers(:,i) = accumarray(C, data(:,i), [], @mean);
end
%# calculate distance of each instance to all cluster centers
D = zeros(size(X,1), K);
for k=1:K
D(:,k) = sum( bsxfun(@minus, X, centers(k,:)).^2, 2);
end
%# assign each instance to the closest cluster
[~,clustIDX] = min(D, [], 2);
%#clustIDX( ind(1:SUBSET_SIZE) ) = C;
%# plot the entire data colored by clusters
subplot(122), gscatter(X(:,1), X(:,2), clustIDX), axis tight
这篇关于在MATLAB中使用clusterdata时出现内存不足错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!