问题描述
我正在尝试对来自KDD 1999杯子数据集的一些数据进行聚类
I'm trying to cluster some data I have from the KDD 1999 cup dataset
文件的输出如下:
0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,19,19,1.00,0.00,0.05,0.00,0.00,0.00,0.00,0.00,normal.
具有该格式的48,000个不同的记录.我已经清理了数据,并删除了仅保留数字的文本.现在的输出看起来像这样:
with 48 thousand different records in that format. I have cleaned the data up and removed the text keeping only the numbers. The output looks like this now:
我在excel中创建了一个逗号分隔文件并另存为csv文件,然后从matlab中的csv文件创建了数据源,我尝试通过matlab中的fcm工具箱运行它(findcluster输出38种数据类型,预期38列).
I created a comma delimited file in excel and saved as a csv file then created a data source from the csv file in matlab, ive tryed running it through the fcm toolbox in matlab (findcluster outputs 38 data types which is expected with 38 columns).
但是,群集看起来不像群集,或者它无法按照我需要的方式工作.
The clusters however don't look like clusters or its not accepting and working the way I need it to.
有人可以帮助找到这些集群吗?对Matlab来说,我是新手,所以没有任何经验,对于集群我也很新.
Could anyone help finding the clusters? Im new to matlab so don't have any experience and I'm also new to clustering.
方法:
- 选择簇数(K)
- 初始化质心(从数据集中随机选择K个模式)
- 将每个模式分配给具有最接近质心的聚类
- 计算每个聚类的均值以使其成为新质心
- 重复第3步,直到满足停止条件为止(没有模式移动到另一个集群)
这是我要实现的目标:
这就是我得到的:
load kddcup1.dat
plot(kddcup1(:,1),kddcup1(:,2),'o')
[center,U,objFcn] = fcm(kddcup1,2);
Iteration count = 1, obj. fcn = 253224062681230720.000000
Iteration count = 2, obj. fcn = 241493132059137410.000000
Iteration count = 3, obj. fcn = 241484544542298110.000000
Iteration count = 4, obj. fcn = 241439204971005280.000000
Iteration count = 5, obj. fcn = 241090628742523840.000000
Iteration count = 6, obj. fcn = 239363408546874750.000000
Iteration count = 7, obj. fcn = 238580863900727680.000000
Iteration count = 8, obj. fcn = 238346826370420990.000000
Iteration count = 9, obj. fcn = 237617756429912510.000000
Iteration count = 10, obj. fcn = 226364785036628320.000000
Iteration count = 11, obj. fcn = 94590774984961184.000000
Iteration count = 12, obj. fcn = 2220521449216102.500000
Iteration count = 13, obj. fcn = 2220521273191876.200000
Iteration count = 14, obj. fcn = 2220521273191876.700000
Iteration count = 15, obj. fcn = 2220521273191876.700000
figure
plot(objFcn)
title('Objective Function Values')
xlabel('Iteration Count')
ylabel('Objective Function Value')
maxU = max(U);
index1 = find(U(1, :) == maxU);
index2 = find(U(2, :) == maxU);
figure
line(kddcup1(index1, 1), kddcup1(index1, 2), 'linestyle',...
'none','marker', 'o','color','g');
line(kddcup1(index2,1),kddcup1(index2,2),'linestyle',...
'none','marker', 'x','color','r');
hold on
plot(center(1,1),center(1,2),'ko','markersize',15,'LineWidth',2)
plot(center(2,1),center(2,2),'kx','markersize',15,'LineWidth',2)
推荐答案
由于您是机器学习/数据挖掘的新手,因此您不应该解决此类高级问题.毕竟,您正在使用的数据是在比赛(KDD Cup'99)中使用的,所以不要指望它会那么简单!
Since you are new to machine-learning/data-mining, you shouldn't tackle such advanced problems. After all, the data you are working with was used in a competition (KDD Cup'99), so don't expect it to be easy!
此外,数据还用于分类任务(监督学习),目的是预测正确的课程(不良/良好的联系).您似乎对聚类(无监督学习)感兴趣,这通常比较困难.
Besides the data was intended for a classification task (supervised learning), where the goal is predict the correct class (bad/good connection). You seem to be interested in clustering (unsupervised learning), which is generally more difficult.
这类数据集需要大量预处理和聪明的特征提取.人们通常会利用领域知识(网络入侵检测)来从原始数据中获取更好的功能.直接应用简单的算法(例如K-means)通常会产生较差的结果.
This sort of dataset requires a lot of preprocessing and clever feature extraction. People usually employ domain knowledge (network intrusion detection) to obtain better features from the raw data.. Directly applying simple algorithms like K-means will generally yield poor results.
对于初学者,您需要将属性规格化为相同比例:在方法中作为步骤3的一部分来计算欧几里德距离时,具有诸如239
和486
的值的特征将占主导地位.其他具有较小值0.05
的功能,从而破坏了结果.
For starters, you need to normalize the attributes to be of the same scale: when computing the euclidean distance as part of step 3 in your method, the features with values such as 239
and 486
will dominate over the other features with small values as 0.05
, thus disrupting the result.
要记住的另一点是,太多的属性可能是一件坏事(维数的诅咒).因此,您应该研究特征选择或降维技术.
Another point to remember is that too many attributes can be a bad thing (curse of dimensionality). Thus you should look into feature selection or dimensionality reduction techniques.
最后,我建议您熟悉一个更简单的数据集...
Finally, I suggest you familiarize yourself with a simpler dataset...
这篇关于聚类和Matlab的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!