问题描述
这个问题与机器学习特征选择过程有关.
This question relates to a machine learning feature selection procedure.
我的特征矩阵很大-列是主题(行)的特征:
I have a large matrix of features - columns are the features of the subjects (rows):
set.seed(1)
features.mat <- matrix(rnorm(10*100),ncol=100)
colnames(features.mat) <- paste("F",1:100,sep="")
rownames(features.mat) <- paste("S",1:10,sep="")
在不同条件(C
)下对每个受试者(S
)的反应进行了测量,因此看起来像这样:
The response was measured for each subject (S
) under different conditions (C
) and therefore looks like this:
response.df <-
data.frame(S = c(sapply(1:10, function(x) rep(paste("S", x, sep = ""),100))),
C = rep(paste("C", 1:100, sep = ""), 10),
response = rnorm(1000), stringsAsFactors = F)
所以我匹配了response.df
中的主题:
So I match the subjects in response.df
:
match.idx <- match(response.df$S, rownames(features.mat))
我正在寻找一种快速的方法来计算每个特征和响应的单变量回归.
I'm looking for a fast way to compute the univariate regression of each feature and the response.
有什么比这更快的吗?:
Anything faster than this?:
fun <- function(f){
fit <- lm(response.df$response ~ features.mat[match.idx,f])
beta <- coef(summary(fit))
data.frame(feature = colnames(features.mat)[f], effect = beta[2,1],
p.val = beta[2,4], stringsAsFactors = F))
}
res <- do.call(rbind, lapply(1:ncol(features.mat), fun))
我对边际提升感兴趣,即除了通过mclapply
或mclapply2
使用并行计算之外的其他方法.
I am interested in marginal boost, i.e., methods other than using parallel computing via mclapply
or mclapply2
.
推荐答案
我将提供一个轻量级的玩具例程,用于估算简单的回归模型:y ~ x
,即仅具有截距和斜率的回归线. 可以看到,这比lm
+ summary.lm
快36倍.
I would provide a light-weighed toy routine for estimation of a simple regression model: y ~ x
, i.e., a regression line with only an intercept and slope. As will be seen, this is 36 times faster than lm
+ summary.lm
.
## toy data
set.seed(0)
x <- runif(50)
y <- 0.3 * x + 0.1 + rnorm(50, sd = 0.05)
## fast estimation of simple linear regression: y ~ x
simplelm <- function (x, y) {
## number of data
n <- length(x)
## centring
y0 <- sum(y) / length(y); yc <- y - y0
x0 <- sum(x) / length(x); xc <- x - x0
## fitting an intercept-free model: yc ~ xc + 0
xty <- c(crossprod(xc, yc))
xtx <- c(crossprod(xc))
slope <- xty / xtx
rc <- yc - xc * slope
## Pearson estimate of residual standard error
sigma2 <- c(crossprod(rc)) / (n - 2)
## standard error for slope
slope_se <- sqrt(sigma2 / xtx)
## t-score and p-value for slope
tscore <- slope / slope_se
pvalue <- 2 * pt(abs(tscore), n - 2, lower.tail = FALSE)
## return estimation summary for slope
c("Estimate" = slope, "Std. Error" = slope_se, "t value" = tscore, "Pr(>|t|)" = pvalue)
}
让我们进行测试:
simplelm(x, y)
# Estimate Std. Error t value Pr(>|t|)
#2.656737e-01 2.279663e-02 1.165408e+01 1.337380e-15
另一方面,lm
+ summary.lm
给出:
On the other hand, lm
+ summary.lm
gives:
coef(summary(lm(y ~ x)))
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.1154549 0.01373051 8.408633 5.350248e-11
#x 0.2656737 0.02279663 11.654079 1.337380e-15
因此结果匹配.如果需要R平方和调整后的R平方,也可以很容易地计算出它.
So the result matches. If you require R-squared and adjusted R-squared, it can be easily computed, too.
让我们有一个基准:
set.seed(0)
x <- runif(10000)
y <- 0.3 * x + 0.1 + rnorm(10000, sd = 0.05)
library(microbenchmark)
microbenchmark(coef(summary(lm(y ~ x))), simplelm(x, y))
#Unit: microseconds
# expr min lq mean median uq
# coef(summary(lm(y ~ x))) 14158.28 14305.28 17545.1544 14444.34 17089.00
# simplelm(x, y) 235.08 265.72 485.4076 288.20 319.46
# max neval cld
# 114662.2 100 b
# 3409.6 100 a
圣洁!我们提高了36倍!
simplelm
基于通过Cholesky因式分解法求解正则方程.但是因为它很简单,所以不涉及实际的矩阵计算.如果我们需要使用多个协变量进行回归,则可以使用此答案中定义的 lm.chol
.
The simplelm
is based on solving normal equation via Cholesky factorization. But since it is simple, no actual matrix computation is involved. If we need regression with multiple covariates, we can use the lm.chol
defined in my this answer.
正规方程也可以通过使用LU分解来求解.我不会对此进行说明,但是如果您有兴趣,请参见:求解法线方程与使用lm
给出的系数不同?.
Normal equation can also be solved by using LU factorization. I will not touch on this, but if you feel interested, here is it: Solving normal equation gives different coefficients from using lm
?.
在我的答案中simplelm
是fastsim
的扩展蒙特卡罗模拟两个布朗运动之间的相关性(连续随机步行).另一种方法是基于cor.test
.它也比lm
+ summary.lm
快得多,但是如该答案所示,它仍然比我上面的建议要慢.
The simplelm
is an extension to the fastsim
in my answer Monte Carlo simulation of correlation between two Brownian motion (continuous random walk). An alternative way is based on cor.test
. It is also much faster than lm
+ summary.lm
, but as shown in that answer, it is yet slower than my proposal above.
QR的方法也是可行的,在这种情况下,我们想使用.lm.fit
,这是C级的qr.default
,qr.coef
,qr.fitted
和qr.resid
的轻量级包装器.我们可以通过以下方法将此选项添加到我们的simplelm
:
QR based method is also possible, in which case we want to use .lm.fit
, a light-weighed wrapper for qr.default
, qr.coef
, qr.fitted
and qr.resid
at C-level. Here is how we can add this option to our simplelm
:
## fast estimation of simple linear regression: y ~ x
simplelm <- function (x, y, QR = FALSE) {
## number of data
n <- length(x)
## centring
y0 <- sum(y) / length(y); yc <- y - y0
x0 <- sum(x) / length(x); xc <- x - x0
## fitting intercept free model: yc ~ xc + 0
if (QR) {
fit <- .lm.fit(matrix(xc), yc)
slope <- fit$coefficients
rc <- fit$residuals
} else {
xty <- c(crossprod(xc, yc))
xtx <- c(crossprod(xc))
slope <- xty / xtx
rc <- yc - xc * slope
}
## Pearson estimate of residual standard error
sigma2 <- c(crossprod(rc)) / (n - 2)
## standard error for slope
if (QR) {
slope_se <- sqrt(sigma2) / abs(fit$qr[1])
} else {
slope_se <- sqrt(sigma2 / xtx)
}
## t-score and p-value for slope
tscore <- slope / slope_se
pvalue <- 2 * pt(abs(tscore), n - 2, lower.tail = FALSE)
## return estimation summary for slope
c("Estimate" = slope, "Std. Error" = slope_se, "t value" = tscore, "Pr(>|t|)" = pvalue)
}
对于我们的玩具数据,QR方法和Cholesky方法均得出相同的结果:
For our toy data, both QR method and Cholesky method give the same result:
set.seed(0)
x <- runif(50)
y <- 0.3 * x + 0.1 + rnorm(50, sd = 0.05)
simplelm(x, y, TRUE)
# Estimate Std. Error t value Pr(>|t|)
#2.656737e-01 2.279663e-02 1.165408e+01 1.337380e-15
simplelm(x, y, FALSE)
# Estimate Std. Error t value Pr(>|t|)
#2.656737e-01 2.279663e-02 1.165408e+01 1.337380e-15
已知QR方法比Cholesky方法慢2到3倍(请阅读我的回答为什么内置lm函数如此R的速度慢?进行详细说明).快速检查一下:
QR methods is known to be 2 ~ 3 times slower than Cholesky method (Read my answer Why the built-in lm function is so slow in R? for detailed explanation). Here is a quick check:
set.seed(0)
x <- runif(10000)
y <- 0.3 * x + 0.1 + rnorm(10000, sd = 0.05)
library(microbenchmark)
microbenchmark(simplelm(x, y, TRUE), simplelm(x, y))
#Unit: microseconds
# expr min lq mean median uq max neval cld
# simplelm(x, y, TRUE) 776.88 873.26 1073.1944 908.72 933.82 3420.92 100 b
# simplelm(x, y) 238.32 292.02 441.9292 310.44 319.32 3515.08 100 a
所以,确实是908 / 310 = 2.93
.
如果继续使用GLM,还有一个基于glm.fit
的快速,轻量级版本.您可以阅读我的答案 R循环帮助:省略一项观察并一次运行一个glm变量,并使用功能在那里定义.目前,f
是为逻辑回归定制的,但是我们可以轻松地将其推广到其他响应.
If we move on to GLM, there is also a fast, light-weighed version based on glm.fit
. You can read my answer R loop help: leave out one observation and run glm one variable at a time and use function f
defined there. At the moment f
is customized to logistic regression, but we can generalize it to other response easily.
这篇关于是否可以快速估算简单回归(仅具有截距和斜率的回归线)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!