问题描述
我需要测试我的Gap Statistics算法(应该告诉我该数据集的最佳k),为此,我需要生成一个易于可聚类的大型数据集,这样我就可以先验地知道最佳数目的聚类.您知道任何快速的方法吗?
I need to test my Gap Statistics algorithm (which should tell me the optimum k for the dataset) and in order to do so I need to generate a big dataset easily clustarable, so that I know a priori the optimum number of clusters. Do you know any fast way to do it?
推荐答案
这在很大程度上取决于您希望使用哪种数据集-1D,2D,3D,正态分布,稀疏等? 大"有多大?成千上万,数十亿的观测值?
It very much depends on what kind of dataset you expect - 1D, 2D, 3D, normal distribution, sparse, etc? And how big is "big"? Thousands, millions, billions of observations?
无论如何,我创建易于识别的聚类的一般方法是将具有不同偏移和散度的随机数的顺序向量连接起来:
Anyway, my general approach to creating easy-to-identify clusters is concatenating sequential vectors of random numbers with different offsets and spreads:
DataSet = [5*randn(1000,1);20+3*randn(1,1000);120+25*randn(1,1000)];
Groups = [1*ones(1000,1);2*ones(1000,1);3*ones(1000,1)];
例如,可以将其扩展到N个功能.
This can be extended to N features by using e.g.
randn(1000,5)
或水平串联
DataSet1 = [5*randn(1000,1);20+3*randn(1,1000);120+25*randn(1,1000)];
DataSet2 = [-100+7*randn(1000,1);1+0.1*randn(1,1000);20+3*randn(1,1000)];
DataSet = [DataSet1 DataSet2];
,依此类推.
randn还接受像
randn(1000,10,3);
用于查看高维簇.
如果您没有关于将应用于哪种数据集的详细信息,则应查找这些信息.
If you don't have details on what kind of datasets this is going to be applied to, you should look for these.
这篇关于如何在MATLAB中生成``可聚类''数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!