

我很惊讶地发现库(群集)中的 clara 允许使用NA。但是函数文档没有说明如何处理这些值。

I was surprised to find out that clara from library(cluster) allows NAs. But function documentation says nothing about how it handles these values.


  1. clara 如何处理NA?

  2. 这可以用于 kmeans (不允许Nas)吗?

  1. How clara handles NAs?
  2. Can this be somehow used for kmeans (Nas not allowed)?

[更新] ,所以我确实在 clara 函数:

[Update] So I did found lines of code in clara function:

inax <- is.na(x)
valmisdat <- 1.1 * max(abs(range(x, na.rm = TRUE)))
x[inax] <- valmisdat

会丢失 valmisdat 的值替换。不知道我理解使用这种公式的原因。有任何想法吗?

which do missing value replacement by valmisdat. Not sure I understand the reason to use such formula. Any ideas? Would it be more "natural" to treat NAs by each column separately, maybe replacing with mean/median?


虽然没有特别说明,但将每列分别处理NA是否更自然,也许用均值/中位数代替?明确地,我相信 NA 是按照?雏菊帮助页面中所述的方式处理的。 详细信息部分具有:

Although not stated explicitly, I believe that NA are handled in the manner described in the ?daisy help page. The Details section has:

在内部,相同的代码将由 clara()这就是我的理解,可以处理数据中的 NA -它们只是不参与计算。在这种情况下,这是一种合理的标准处理方式,例如用于定义Gower的广义相似系数。

Given internally the same code will be being used by clara() that is how I understand that NAs in the data can be handled - they just don't take part in the computation. This is a reasonably standard way of proceeding in such cases and is for example used in the definition of Gower's generalised similarity coefficient.

更新 clara.c 的code> C 来源清楚地表明,这(上述)是 NA的方式 clara()处理(中的第350-356行。/src/clara.c):

Update The C sources for clara.c clearly indicate that this (the above) is how NAs are handled by clara() (lines 350-356 in ./src/clara.c):

    if (has_NA && jtmd[j] < 0) { /* x[,j] has some Missing (NA) */
        /* in the following line (Fortran!), x[-2] ==> seg.fault
           {BDR to R-core, Sat, 3 Aug 2002} */
        if (x[lj] == valmd[j] || x[kj] == valmd[j]) {
        continue /* next j */;


09-14 14:33