本文介绍了每对观测值的马氏距离的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试计算数据集dat的每个观测值之间的马氏距离,其中每一行是一个观测值,每一列是一个变量.这样的距离定义为:

I am trying to compute the Mahalanobis distance between each observations of a dataset dat, where each row is an observation and each column is a variable. Such distance is defined as:

我编写了一个函数来执行此操作,但是我感觉它很慢.有没有更好的方法可以在R中对此进行计算?

I wrote a function that does it, but I feel like it is slow. Is there any better way to compute this in R ?

生成一些数据以测试该功能:

To generate some data to test the function:

generateData <- function(nObs, nVar){
  library(MASS)
  mvrnorm(n=nObs, rep(0,nVar), diag(nVar))
  }

这是我到目前为止编写的函数.对于我的数据(800 obs和90个变量),它们都起作用,method = "forLoop"method = "apply"分别花费大约30和33秒.

This is the function I have written so far. They both work and for my data (800 obs and 90 variables), it takes approximatively 30 and 33 seconds for the method = "forLoop" and method = "apply", respectively.

mhbd_calc2 <- function(dat, method) { #Method is either "forLoop" or "apply"
  dat <- as.matrix(na.omit(dat))
  nObs <- nrow(dat)
  mhbd <- matrix(nrow=nObs,ncol = nObs)
  cv_mat_inv = solve(var(dat))

  distMH = function(x){  #Mahalanobis distance function
    diff = dat[x[1],]-dat[x[2],]
    diff %*% cv_mat_inv %*% diff
  }

  if(method=="forLoop")
  {
    for (i in 1:nObs){
      for(j in 1:i){
        mhbd[i,j] <- distMH(c(i,j))
      }
    }
  }
  if(method=="apply")
  {
    mhbd[lower.tri(mhbd)] = apply(combn(nrow(dat),2),2, distMH)
  }
  result = sqrt(mhbd)
  colnames(result)=rownames(dat)
  rownames(result)=rownames(dat)
  return(as.dist(result))
}

注意:我尝试使用outer(),但是它甚至更慢(60秒)

NB: I tried using outer() but it was even slower (60seconds)

推荐答案

您需要一些数学知识.

  1. 对经验协方差进行Cholesky分解,然后标准化您的观察结果;
  2. 使用dist来计算变换后的观测值的欧几里得距离.
  1. Do a Cholesky factorization of empirical covariance, then standardize your observations;
  2. use dist to compute Euclidean distance on transformed observations.


dist.maha <- function (dat) {
  X <- as.matrix(na.omit(dat))  ## ensure a valid matrix
  V <- cov(X)  ## empirical covariance; positive definite
  L <- t(chol(V))  ## lower triangular factor
  stdX <- t(forwardsolve(L, t(X)))  ## standardization
  dist(stdX)  ## use `dist`
  }

示例

set.seed(0)
x <- matrix(rnorm(6 * 3), 6, 3)

dist.maha(x)
#         1        2        3        4        5
#2 2.362109
#3 1.725084 1.495655
#4 2.959946 2.715641 2.690788
#5 3.044610 1.218184 1.531026 2.717390
#6 2.740958 1.694767 2.877993 2.978265 2.794879

结果与您的mhbd_calc2一致.

这篇关于每对观测值的马氏距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-04 23:06