本文介绍了是否可以快速估算简单回归(仅具有截距和斜率的回归线)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题与机器学习特征选择过程有关.

This question relates to a machine learning feature selection procedure.

我的特征矩阵很大-列是主题(行)的特征:

I have a large matrix of features - columns are the features of the subjects (rows):

set.seed(1)
features.mat <- matrix(rnorm(10*100),ncol=100)
colnames(features.mat) <- paste("F",1:100,sep="")
rownames(features.mat) <- paste("S",1:10,sep="")

在不同条件(C)下对每个受试者(S)的反应进行了测量,因此看起来像这样:

The response was measured for each subject (S) under different conditions (C) and therefore looks like this:

response.df <-
data.frame(S = c(sapply(1:10, function(x) rep(paste("S", x, sep = ""),100))),
           C = rep(paste("C", 1:100, sep = ""), 10),
           response = rnorm(1000), stringsAsFactors = F)

所以我匹配了response.df中的主题:

So I match the subjects in response.df:

match.idx <- match(response.df$S, rownames(features.mat))

我正在寻找一种快速的方法来计算每个特征和响应的单变量回归.

I'm looking for a fast way to compute the univariate regression of each feature and the response.

有什么比这更快的吗?:

Anything faster than this?:

fun <- function(f){
  fit <- lm(response.df$response ~ features.mat[match.idx,f])
  beta <- coef(summary(fit))
  data.frame(feature = colnames(features.mat)[f], effect = beta[2,1],
             p.val = beta[2,4], stringsAsFactors = F))
  }

res <- do.call(rbind, lapply(1:ncol(features.mat), fun))

我对边际提升感兴趣,即除了通过mclapplymclapply2使用并行计算之外的其他方法.

I am interested in marginal boost, i.e., methods other than using parallel computing via mclapply or mclapply2.

推荐答案

我将提供一个轻量级的玩具例程,用于估算简单的回归模型:y ~ x,即仅具有截距和斜率的回归线. 可以看到,这比lm + summary.lm快36倍.

I would provide a light-weighed toy routine for estimation of a simple regression model: y ~ x, i.e., a regression line with only an intercept and slope. As will be seen, this is 36 times faster than lm + summary.lm.

## toy data
set.seed(0)
x <- runif(50)
y <- 0.3 * x + 0.1 + rnorm(50, sd = 0.05)

## fast estimation of simple linear regression: y ~ x
simplelm <- function (x, y) {
  ## number of data
  n <- length(x)
  ## centring
  y0 <- sum(y) / length(y); yc <- y - y0
  x0 <- sum(x) / length(x); xc <- x - x0
  ## fitting an intercept-free model: yc ~ xc + 0
  xty <- c(crossprod(xc, yc))
  xtx <- c(crossprod(xc))
  slope <- xty / xtx
  rc <- yc - xc * slope
  ## Pearson estimate of residual standard error
  sigma2 <- c(crossprod(rc)) / (n - 2)
  ## standard error for slope
  slope_se <- sqrt(sigma2 / xtx)
  ## t-score and p-value for slope
  tscore <- slope / slope_se
  pvalue <- 2 * pt(abs(tscore), n - 2, lower.tail = FALSE)
  ## return estimation summary for slope
  c("Estimate" = slope, "Std. Error" = slope_se, "t value" = tscore, "Pr(>|t|)" = pvalue)
  }

让我们进行测试:

simplelm(x, y)

#    Estimate   Std. Error      t value     Pr(>|t|)
#2.656737e-01 2.279663e-02 1.165408e+01 1.337380e-15

另一方面,lm + summary.lm给出:

On the other hand, lm + summary.lm gives:

coef(summary(lm(y ~ x)))

#             Estimate Std. Error   t value     Pr(>|t|)
#(Intercept) 0.1154549 0.01373051  8.408633 5.350248e-11
#x           0.2656737 0.02279663 11.654079 1.337380e-15

因此结果匹配.如果需要R平方和调整后的R平方,也可以很容易地计算出它.

So the result matches. If you require R-squared and adjusted R-squared, it can be easily computed, too.

让我们有一个基准:

set.seed(0)
x <- runif(10000)
y <- 0.3 * x + 0.1 + rnorm(10000, sd = 0.05)

library(microbenchmark)

microbenchmark(coef(summary(lm(y ~ x))), simplelm(x, y))

#Unit: microseconds
#                     expr      min       lq       mean   median       uq
# coef(summary(lm(y ~ x))) 14158.28 14305.28 17545.1544 14444.34 17089.00
#           simplelm(x, y)   235.08   265.72   485.4076   288.20   319.46
#      max neval cld
# 114662.2   100   b
#   3409.6   100  a

圣洁!我们提高了36倍!

simplelm基于通过Cholesky因式分解法求解正则方程.但是因为它很简单,所以不涉及实际的矩阵计算.如果我们需要使用多个协变量进行回归,则可以使用此答案中定义的 lm.chol .

The simplelm is based on solving normal equation via Cholesky factorization. But since it is simple, no actual matrix computation is involved. If we need regression with multiple covariates, we can use the lm.chol defined in my this answer.

正规方程也可以通过使用LU分解来求解.我不会对此进行说明,但是如果您有兴趣,请参见:求解法线方程与使用lm给出的系数不同?.

Normal equation can also be solved by using LU factorization. I will not touch on this, but if you feel interested, here is it: Solving normal equation gives different coefficients from using lm?.

在我的答案中simplelmfastsim的扩展蒙特卡罗模拟两个布朗运动之间的相关性(连续随机步行).另一种方法是基于cor.test.它也比lm + summary.lm快得多,但是如该答案所示,它仍然比我上面的建议要慢.

The simplelm is an extension to the fastsim in my answer Monte Carlo simulation of correlation between two Brownian motion (continuous random walk). An alternative way is based on cor.test. It is also much faster than lm + summary.lm, but as shown in that answer, it is yet slower than my proposal above.

QR的方法也是可行的,在这种情况下,我们想使用.lm.fit,这是C级的qr.defaultqr.coefqr.fittedqr.resid的轻量级包装器.我们可以通过以下方法将此选项添加到我们的simplelm:

QR based method is also possible, in which case we want to use .lm.fit, a light-weighed wrapper for qr.default, qr.coef, qr.fitted and qr.resid at C-level. Here is how we can add this option to our simplelm:

## fast estimation of simple linear regression: y ~ x
simplelm <- function (x, y, QR = FALSE) {
  ## number of data
  n <- length(x)
  ## centring
  y0 <- sum(y) / length(y); yc <- y - y0
  x0 <- sum(x) / length(x); xc <- x - x0
  ## fitting intercept free model: yc ~ xc + 0
  if (QR) {
    fit <- .lm.fit(matrix(xc), yc)
    slope <- fit$coefficients
    rc <- fit$residuals
    } else {
    xty <- c(crossprod(xc, yc))
    xtx <- c(crossprod(xc))
    slope <- xty / xtx
    rc <- yc - xc * slope
    }
  ## Pearson estimate of residual standard error
  sigma2 <- c(crossprod(rc)) / (n - 2)
  ## standard error for slope
  if (QR) {
    slope_se <- sqrt(sigma2) / abs(fit$qr[1])
    } else {
    slope_se <- sqrt(sigma2 / xtx)
    }
  ## t-score and p-value for slope
  tscore <- slope / slope_se
  pvalue <- 2 * pt(abs(tscore), n - 2, lower.tail = FALSE)
  ## return estimation summary for slope
  c("Estimate" = slope, "Std. Error" = slope_se, "t value" = tscore, "Pr(>|t|)" = pvalue)
  }

对于我们的玩具数据,QR方法和Cholesky方法均得出相同的结果:

For our toy data, both QR method and Cholesky method give the same result:

set.seed(0)
x <- runif(50)
y <- 0.3 * x + 0.1 + rnorm(50, sd = 0.05)

simplelm(x, y, TRUE)

#    Estimate   Std. Error      t value     Pr(>|t|)
#2.656737e-01 2.279663e-02 1.165408e+01 1.337380e-15

simplelm(x, y, FALSE)

#    Estimate   Std. Error      t value     Pr(>|t|)
#2.656737e-01 2.279663e-02 1.165408e+01 1.337380e-15

已知QR方法比Cholesky方法慢2到3倍(请阅读我的回答为什么内置lm函数如此R的速度慢?进行详细说明).快速检查一下:

QR methods is known to be 2 ~ 3 times slower than Cholesky method (Read my answer Why the built-in lm function is so slow in R? for detailed explanation). Here is a quick check:

set.seed(0)
x <- runif(10000)
y <- 0.3 * x + 0.1 + rnorm(10000, sd = 0.05)

library(microbenchmark)

microbenchmark(simplelm(x, y, TRUE), simplelm(x, y))

#Unit: microseconds
#                 expr    min     lq      mean median     uq     max neval cld
# simplelm(x, y, TRUE) 776.88 873.26 1073.1944 908.72 933.82 3420.92   100   b
#       simplelm(x, y) 238.32 292.02  441.9292 310.44 319.32 3515.08   100  a

所以,确实是908 / 310 = 2.93.

如果继续使用GLM,还有一个基于glm.fit的快速,轻量级版本.您可以阅读我的答案 R循环帮助:省略一项观察并一次运行一个glm变量,并使用功能在那里定义.目前,f是为逻辑回归定制的,但是我们可以轻松地将其推广到其他响应.

If we move on to GLM, there is also a fast, light-weighed version based on glm.fit. You can read my answer R loop help: leave out one observation and run glm one variable at a time and use function f defined there. At the moment f is customized to logistic regression, but we can generalize it to other response easily.

这篇关于是否可以快速估算简单回归(仅具有截距和斜率的回归线)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-23 21:02