分类和数值数据的聚类

分类和数值数据的聚类

本文介绍了分类和数值数据的聚类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组警报,我想根据相似度/距离对其进行分组。由于我们拥有非数值数据,我该如何对此类数据执行聚类。

I have a collection of alerts and I want to group it based on similarity/distance. As we have non-numeric data, How can i perform clustering for this kind of data.

  set.seed(42)
  data.frame(Host1 = rep("del",10),
  Host2 = c(rep("cpp",4), rep("sscp",3), rep("portal",3)),
 Host3 = c(rep("web",5), rep("apache",3), rep("app",2)),
 Host4 = c(sample(3,8, replace = TRUE), rep("con",2)),
 Date1 = abs(round(1:10 + rnorm(10),2)))



   Host1  Host2  Host3 Host4 Date1
1    del    cpp    web     3  1.40
2    del    cpp    web     3  1.89
3    del    cpp    web     1  4.51
4    del    cpp    web     3  3.91
5    del   sscp    web     2  7.02
6    del   sscp apache     2  5.94
7    del   sscp apache     3  8.30
8    del portal apache     1 10.29
9    del portal    app   con  7.61
10   del portal    app   con  9.72

期待构建集群。

推荐答案

K均值仅适用于数值(连续)数据



根据定义,它可将偏差平方最小化。最小化平方偏差仅对连续数据有意义。任何一种一键编码都只是一种破解。

K-means only works for numerical (continuous) data

By definition, it minimizes squared deviations. Minimizing squared deviations only make sense on continuous data. Any kind of one-hot-encoding is only a hack; it makes the data types compatible, but not the approach sensible.

分层聚类将起作用。如果您可以定义一个有意义的距离函数来量化距离。但这这取决于应用程序。我们没有您的数据,也不了解您的问题。我们无法为您解决这个问题。

Hierarchical clustering would work. If you can define a meaningful distance function that quantifies distance. But this is application dependant. We do not have your data, and do not understand your problem. We cannot solve this for you.

这篇关于分类和数值数据的聚类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-13 19:11