本文介绍了我很难理解OPTICS聚类算法中的订购概念的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很难理解OPTICS聚类算法中的订购概念。



如果有人能对订购和订购提供逻辑和直观的解释,我将不胜感激。还说明了以下代码中 res $ order 的功能以及可获取性图(可以通过命令 plot(res)获得)。

 库(dbscan)

set.seed(2)
n
x<-cbind(
x = runif(4,0,1)+ rnorm(n,sd = 0.1),
y = runif(4,0,1)+ rnorm(n, sd = 0.1)


图(x,col = rep(1:4,time = 100))


res res

res $ order
图(res)

res $ order给出以下输出:

绘图生成的可及性图我无法发布,因为这是我在StackExchange上遇到的第一个问题...但是如果您运行R代码,则可以轻松获得它。

解决方案

我一直在同一个问题上挣扎,经过一番研究,我认为我终于了解了它是如何工作的。



基本想法:



我现在将添加



根据我的了解,可达性距离是距离从每个点到最接近的核心点。



可达性图中谷和穗的含义



看一下伪代码,现在将完全清楚:当一个点的最终可达距离非常大时,这意味着该点与最近的核心点相距很远,因此它不属于任何群集。这就是为什么可到达性图中的峰值表示群集之间的分离……这些峰值表示异常值。实际上,离群值是指可达距离大于指定的epsilon的点(它们不在任何核心点附近)。



至于为什么山谷代表集群,那必须与集群增长的方式有关。记住DBSCAN是如何工作的,通过一连串直接连接的核心点来扩展集群...使用OPTICS相同,同时还要考虑到跟踪添加到当前集群中每个点的可达距离。考虑一下簇的形成方式,以了解山谷的含义。


I am having a hard time understanding the concept of Ordering in OPTICS Clustering algorithm.

I Would be grateful if someone gives a logical and intuitive explanation of the ordering and also explain what res$order does in the following code and what is the reahability plot(which can be obtained by the command 'plot(res)').

library(dbscan)

set.seed(2)
n <- 400

x <- cbind(
  x = runif(4, 0, 1) + rnorm(n, sd=0.1),
  y = runif(4, 0, 1) + rnorm(n, sd=0.1)
  )

plot(x, col=rep(1:4, time = 100))


res <- optics(x, eps = 10,  minPts = 10)
res

res$order
plot(res)

res$order gives the following output:

and the 'plot' produces a reachability plot which I am not able to post because this is my first question on StackExchange...but if you run the R code you can easily get it.

解决方案

I was struggling with the same issue and after some research I think I finally understood how it works.

BASIC IDEA:

I will now add the pseudocode provided by Wikipedia, commented by me to explain it a little bit:

OPTICS(DB, eps, MinPts)
for each point p of DB
   p.reachability-distance = UNDEFINED
for each unprocessed point p of DB
   N = getNeighbors(p, eps)
   mark p as processed
   output p to the ordered list          # ordered list = final result
   # if p is a core point (has at least minPts in the specified radius)
   if (core-distance(p, eps, Minpts) != UNDEFINED)
      Seeds = empty priority queue
      # update the reachability-distance for every neighbour
      update(N, p, Seeds, eps, Minpts)
      # seeds will have the neighbours wich reachability-distance was updated
      # with the selected core point
      for each next q in Seeds
         N' = getNeighbors(q, eps)
         mark q as processed
         output q to the ordered list          # ordered list = final result
         # if the neighbor is a core point, grow the cluster as DBSCAN does
         if (core-distance(q, eps, Minpts) != UNDEFINED)
            update(N', q, Seeds, eps, Minpts)


update(N, p, Seeds, eps, MinPts)
coredist = core-distance(p, eps, MinPts)
# for every neighbor
for each o in N
   if (o is not processed)
      new-reach-dist = max(coredist, dist(p,o))
      if (o.reachability-distance == UNDEFINED) // o is not in Seeds
          o.reachability-distance = new-reach-dist
          Seeds.insert(o, new-reach-dist)
      else               // o in Seeds, check for improvement
          if (new-reach-dist < o.reachability-distance)
             o.reachability-distance = new-reach-dist
             Seeds.move-up(o, new-reach-dist)

From that pseudocode, I get the following:

  • You will need an ordered list (to represent the reachability plot)
  • Clusters will grow as they do in the DBSCAN algorithm
  • The difference with DBSCAN is that now, when you get the neighborhood of a core point, for every neighbor you will have to calculate a so called reachability distance; if the neighbor hasn't one, just save the one you get and put that point next to the core one in the ordered list; if it had one already, compare the old one with the one obtained when processing the core point and go with the smaller one. If that one happens to be the new one, an update must be done in the ordered list to accommodate the point close to the core point.

REACHABILITY DISTANCE

Now, it's crucial to understand what's the reachability distance. Also, from Wikipedia, we have:

From what I get, the reachability distance is the distance from every point to its closest core point.

MEANING OF VALLEYS AND SPIKES IN REACHABILITY PLOT

If we take a look at the pseudocode, now it will be totally clear: when the final reachability distance for a point is super big, that means the point is far far away from the closest core point, so it doesn't belong to any cluster. That's why the spikes in the reachability plot denote the separation between clusters ... those spikes represent outliers. In fact, outliers are those points which reachability distance is greater than the specified epsilon (they aren't in any core point neighbourhood).

As for why valleys represent clusters, that has to be with the way the clusters grow; remember how DBSCAN works, growing a cluster following a chain of directly connected core points ... with OPTICS that's the same, considering also we keep track of the reachability distance of every point we add to the current cluster. Think about the way clusters are form to understand the meaning of the valleys.

这篇关于我很难理解OPTICS聚类算法中的订购概念的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

06-16 15:32