问题描述
假设,我有连续10天对应于5个类别的每小时数据,创建为:
Assume, I have hourly data corresponding to 5 categories for consective 10 days, created as:
library(xts)
set.seed(123)
timestamp <- seq(as.POSIXct("2016-10-01"),as.POSIXct("2016-10-10 23:59:59"), by = "hour")
data <- data.frame(cat1 = rnorm(length(timestamp),150,5),
cat2 = rnorm(length(timestamp),130,3),
cat3 = rnorm(length(timestamp),150,5),
cat4 = rnorm(length(timestamp),100,8),
cat5 = rnorm(length(timestamp),200,15))
data_obj <- xts(data,timestamp) # creat time-series object
head(data_obj,2)
现在,我分别对每一天进行聚类,并使用简单的kmeans
来查看这些类别相对于彼此的行为:
Now, for each day separately, I perform clustering and see how these categories behave with respect to each other using simple kmeans
as:
daywise_data <- split.xts(data_obj,f="days",k=1) # split data day wise
clus_obj <- lapply(daywise_data, function(x){ # clustering day wise
return (kmeans(t(x), 2))
})
聚类结束后,我会用
sapply(clus_obj,function(x) x$cluster) # clustering results
我发现结果为
在目视检查中,很明显cat1
和cat3
始终保留在同一群集中.类似地,cat4
和cat5
在10个不同的日期大多位于不同的群集中.
On visual inspection, it is clear that cat1
and cat3
always remained in the same cluster. Similarly cat4
and cat5
are mostly in different clusters on 10 different days.
除了外观检查之外,是否有任何自动方法可从此类聚类表中收集此类统计信息?
注意:这是一个虚拟的示例.我有一个数据框,其中包含连续100天的80个类别.像上面的自动摘要将减少工作量.
Note: This is a dummy example. I have a data frame containing such 80 categories over continuous 100 days. An automatic summary like above one will reduce the effort.
推荐答案
对数集群评估方法显示了解决此问题的简便方法.
Pair-counting cluster evaluation measures show an easy way to tackle this problem.
这些方法不是查看不稳定的对象-群集分配,而是查看两个对象是否在同一群集(称为对")中.
Rather than looking at object-cluster assignments, which are unstable, these methods look at whether or not two objects are in the same cluster (that is called a "pair").
因此您可以检查这些对是否随时间变化很大.
So you could check if these pairs change much over time, or not.
由于k均值是随机的,因此您可能还希望对每个时间片运行几次,因为它们可能返回不同的聚类!
Since k-means is randomized, you may also want to run it several times for every time slice, as they may return different clusterings!
然后您可以说在结果的90%中,系列1与系列2位于同一类中.等
You could then say that e.g. series 1 is in the same cluster as series 2 in 90% of the results. etc.
这篇关于几天内集群成员关系的统计信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!