问题描述
我想用四个可以自由指定的预测变量来模拟多元线性回归的数据
I'd like to simulate data for a multiple linear regression with four predictors where I am free to specify
- 模型的总体解释方差
- 所有标准化回归系数的大小
- 预测变量相互关联的程度
我得出了一个满足前两个要点的解决方案,但是它基于所有自变量都不相互关联的假设(请参见下面的代码).为了获得标准化的回归系数,我从均值= 0和方差= 1的总体变量中采样.
I arrived at a solution that fulfills the first two points but is based on the assumption that all independent variables are not related to each other (see code below). In order to get standardized regression coefficients, I sample from population variables with mean=0 and variance=1.
# Specify population variance/covariance of four predictor variables that is sampled from
sigma.1 <- matrix(c(1,0,0,0,
0,1,0,0,
0,0,1,0,
0,0,0,1),nrow=4,ncol=4)
# Specify population means of four predictor varialbes that is sampled from
mu.1 <- rep(0,4)
# Specify sample size, true regression coefficients, and explained variance
n.obs <- 50000 # to avoid sampling error problems
intercept <- 0.5
beta <- c(0.4, 0.3, 0.25, 0.25)
r2 <- 0.30
# Create sample with four predictor variables
library(MASS)
sample1 <- as.data.frame(mvrnorm(n = n.obs, mu.1, sigma.1, empirical=FALSE))
# Add error variable based on desired r2
var.epsilon <- (beta[1]^2+beta[2]^2+beta[3]^2+beta[4]^2)*((1 - r2)/r2)
sample1$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon))
# Add y variable based on true coefficients and desired r2
sample1$y <- intercept + beta[1]*sample1$V1 + beta[2]*sample1$V2 +
beta[3]*sample1$V3 + beta[4]*sample1$V4 + sample1$epsilon
# Inspect model
summary(lm(y~V1+V2+V3+V4, data=sample1))
Call:
lm(formula = y ~ V1 + V2 + V3 + V4, data = sample1)
Residuals:
Min 1Q Median 3Q Max
-4.0564 -0.6310 -0.0048 0.6339 3.7119
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.496063 0.004175 118.82 <2e-16 ***
V1 0.402588 0.004189 96.11 <2e-16 ***
V2 0.291636 0.004178 69.81 <2e-16 ***
V3 0.247347 0.004171 59.30 <2e-16 ***
V4 0.253810 0.004175 60.79 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9335 on 49995 degrees of freedom
Multiple R-squared: 0.299, Adjusted R-squared: 0.299
F-statistic: 5332 on 4 and 49995 DF, p-value: < 2.2e-16
问题::如果我的预测变量相互关联,那么如果指定了它们的方差/协方差矩阵,且非对角线元素不为0,则r2和回归系数与我想要的变量有很大不同是,例如通过使用
Problem: If my predictor variables are correlated, so if their variance/covariance matrix is specified without the off-diagonal elements being 0, the r2 and regression coefficients differ largely from how I want them to be, e.g. by using
sigma.1 <- matrix(c(1,0.25,0.25,0.25,
0.25,1,0.25,0.25,
0.25,0.25,1,0.25,
0.25,0.25,0.25,1),nrow=4,ncol=4)
有什么建议吗?谢谢!
Any suggestions?Thanks!
推荐答案
再多考虑了我的问题之后,我找到了答案.
After thinking about my problem a bit more, I found an answer.
上面的代码首先以给定的相关度对预测变量进行采样.然后根据所需的r2值添加一列误差.然后将所有这些一起添加到y的列.
The code above first samples the predictor variables with a given degree of correlation among each other. Then a column for the error is added based on the desired value of r2. Then with all of that together a column for y is added.
到目前为止,造成错误的行仅仅是
So far, the line that creates the error is just
var.epsilon <- (beta[1]^2+beta[2]^2+beta[3]^2+beta[4]^2)*((1 - r2)/r2)
sample1$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon))
因此,假定每个beta系数对y的解释贡献100%(=独立变量之间没有相互关系).但是,如果x变量相关,则每个beta都不(!)贡献100%.这意味着误差的方差必须更大,因为变量之间存在一定的可变性.
So it assumes that every beta coefficient contributes 100% to the explanation of y (=no interrelation of independent variables). But if x-variables are related, every beta is not(!) contributing 100%. That means the variance of the error has to be bigger, because the variables take some variability from each other.
大多少?只需像下面这样修改错误项的创建即可:
How much bigger? Just adapt the creation of the error term like follows:
var.epsilon <- (beta[1]^2+beta[2]^2+beta[3]^2+beta[4]^2+cor(sample1$V1, sample1$V2))*((1 - r2)/r2)
因此,通过添加cor(sample1$V1, sample1$V2)
,就可以将独立变量的相关程度添加到误差方差中.在相互关系为0.25的情况下,例如通过使用
So the degree to which the independent variables are correlated is just added to the error variance by adding cor(sample1$V1, sample1$V2)
. In the case of the interrelation being 0.25, e.g. by using
sigma.1 <- matrix(c(1,0.25,0.25,0.25,
0.25,1,0.25,0.25,
0.25,0.25,1,0.25,
0.25,0.25,0.25,1),nrow=4,ncol=4)
cor(sample1$V1, sample1$V2)
类似于0.25,并且将此值添加到误差项的方差中.
cor(sample1$V1, sample1$V2)
resembles 0.25 and this value is added to the variance of the error term.
假设所有相互关系都相等,像这样,可以指定自变量之间的任何相互关系,以及真正的标准化回归系数和所需的R2.
Assuming that all interrelations are equal, like this, any degree of interrelation among the independent variables can be specified, together with the true standardized regression coefficients and an desired R2.
证明:
sigma.1 <- matrix(c(1,0.35,0.35,0.35,
0.35,1,0.35,0.35,
0.35,0.35,1,0.35,
0.35,0.35,0.35,1),nrow=4,ncol=4)
# Specify population means of four predictor varialbes that is sampled from
mu.1 <- rep(0,4)
# Specify sample size, true regression coefficients, and explained variance
n.obs <- 500000 # to avoid sampling error problems
intercept <- 0.5
beta <- c(0.4, 0.3, 0.25, 0.25)
r2 <- 0.15
# Create sample with four predictor variables
library(MASS)
sample1 <- as.data.frame(mvrnorm(n = n.obs, mu.1, sigma.1, empirical=FALSE))
# Add error variable based on desired r2
var.epsilon <- (beta[1]^2+beta[2]^2+beta[3]^2+beta[4]^2+cor(sample1$V1, sample1$V2))*((1 - r2)/r2)
sample1$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon))
# Add y variable based on true coefficients and desired r2
sample1$y <- intercept + beta[1]*sample1$V1 + beta[2]*sample1$V2 +
beta[3]*sample1$V3 + beta[4]*sample1$V4 + sample1$epsilon
# Inspect model
summary(lm(y~V1+V2+V3+V4, data=sample1))
> summary(lm(y~V1+V2+V3+V4, data=sample1))
Call:
lm(formula = y ~ V1 + V2 + V3 + V4, data = sample1)
Residuals:
Min 1Q Median 3Q Max
-10.7250 -1.3696 0.0017 1.3650 9.0460
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.499554 0.002869 174.14 <2e-16 ***
V1 0.406360 0.003236 125.56 <2e-16 ***
V2 0.298892 0.003233 92.45 <2e-16 ***
V3 0.247581 0.003240 76.42 <2e-16 ***
V4 0.253510 0.003241 78.23 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.028 on 499995 degrees of freedom
Multiple R-squared: 0.1558, Adjusted R-squared: 0.1557
F-statistic: 2.306e+04 on 4 and 499995 DF, p-value: < 2.2e-16
这篇关于使用固定的R2模拟多个回归数据:如何合并相关变量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!