问题描述
我有三个数据集:
响应-5(样本)x 10(因变量)的矩阵
response - matrix of 5(samples) x 10(dependent variables)
预测变量-5(样本)x 2(独立变量)的矩阵
predictors - matrix of 5(samples) x 2(independent variables)
test_set-10(样本)x 10(响应中定义的因变量)的矩阵
test_set - matrix of 10(samples) x 10(dependent variables defined in response)
response <- matrix(sample.int(15, size = 5*10, replace = TRUE), nrow = 5, ncol = 10)
colnames(response) <- c("1_DV","2_DV","3_DV","4_DV","5_DV","6_DV","7_DV","8_DV","9_DV","10_DV")
predictors <- matrix(sample.int(15, size = 7*2, replace = TRUE), nrow = 5, ncol = 2)
colnames(predictors) <- c("1_IV","2_IV")
test_set <- matrix(sample.int(15, size = 10*2, replace = TRUE), nrow = 10, ncol = 2)
colnames(test_set) <- c("1_IV","2_IV")
我正在使用定义为响应集和预测变量集组合的训练集进行多元线性模型,我想使用该模型对测试集进行预测:
I'm doing a multivariate linear model using a training set defined as the combination of response and predictor sets, and I would like to use this model to make predictions for the test set:
training_dataframe <- data.frame(predictors, response)
fit <- lm(response ~ predictors, data = training_dataframe)
predictions <- predict(fit, data.frame(test_set))
但是,预测结果确实很奇怪:
However, the results for predictions are really odd:
predictions
首先,矩阵尺寸为5 x 10,这是响应变量中的样本数乘以DV数.
First off the matrix dimensions are 5 x 10, which is the number of samples in the response variable by the number of DVs.
我对R中的这种类型的分析不是很熟练,但是我不应该得到10 x 10的矩阵,以便对test_set中的每一行都有预测吗?
I'm not very skilled with this type of analysis in R, but shouldn't I be getting a 10 x 10 matrix, so that I have predictions for each row in my test_set?
在此问题上的任何帮助将不胜感激,马丁
Any help with this issue would be greatly appreciated,Martin
推荐答案
您正在进入R中受支持不佳的部分.您拥有的模型类是"mlm",即多个线性模型",而不是标准的"lm"类.当您有一组共同的协变量/预测变量的(独立)响应变量时,就会得到此结果.尽管lm()
函数可以适合这种模型,但是对于"mlm"类,predict
方法很差.如果您查看methods(predict)
,您会看到一个predict.mlm*
.通常,对于具有"lm"类的线性模型,调用predict
时将调用predict.lm
;否则,将调用predict.lm
.但是对于"mlm"类,将调用predict.mlm*
.
You are stepping into a poorly supported part in R. The model class you have is "mlm", i.e., "multiple linear models", which is not the standard "lm" class. You get it when you have several (independent) response variables for a common set of covariates / predictors. Although lm()
function can fit such model, predict
method is poor for "mlm" class. If you look at methods(predict)
, you would see a predict.mlm*
. Normally for a linear model with "lm" class, predict.lm
is called when you call predict
; but for a "mlm" class the predict.mlm*
is called.
predict.mlm*
太原始了.它不允许se.fit
,即它不能产生预测误差,置信度/预测间隔等,尽管这在理论上是可能的.它只能计算预测平均值.如果是这样,为什么我们要完全使用predict.mlm*
?预测平均值可以通过平凡的矩阵-矩阵乘法获得(在标准"lm"类中,这是矩阵-矢量乘法),因此我们可以自己完成.
predict.mlm*
is too primitive. It does not allow se.fit
, i.e., it can not produce prediction errors, confidence / prediction intervals, etc, although this is possible in theory. It can only compute prediction mean. If so, why do we want to use predict.mlm*
at all?! The prediction mean can be obtained by a trivial matrix-matrix multiplication (in standard "lm" class this is a matrix-vector multiplication), so we can do it on our own.
考虑一下这个小例子.
set.seed(0)
## 2 response of 10 observations each
response <- matrix(rnorm(20), 10, 2)
## 3 covariates with 10 observations each
predictors <- matrix(rnorm(30), 10, 3)
fit <- lm(response ~ predictors)
class(fit)
# [1] "mlm" "lm"
beta <- coef(fit)
# [,1] [,2]
#(Intercept) 0.5773235 -0.4752326
#predictors1 -0.9942677 0.6759778
#predictors2 -1.3306272 0.8322564
#predictors3 -0.5533336 0.6218942
设置了预测数据后:
# 2 new observations for 3 covariats
test_set <- matrix(rnorm(6), 2, 3)
我们首先需要填充一个拦截列
we first need to pad an intercept column
Xp <- cbind(1, test_set)
然后执行此矩阵乘法
pred <- Xp %*% beta
# [,1] [,2]
#[1,] -2.905469 1.702384
#[2,] 1.871755 -1.236240
也许您已经注意到我在这里甚至没有使用数据框. 是的,因为所有内容都以矩阵形式出现是不必要的.对于那些R向导,也许使用lm.fit
甚至qr.solve
更简单.
Perhaps you have noticed that I did not even use a data frame here. Yes it is unnecessary as you have everything in matrix form. For those R wizards, maybe using lm.fit
or even qr.solve
is more straightforward.
但是,作为一个完整的答案,必须演示如何使用predict.mlm
获得我们想要的结果.
But as a complete answer, it is a must to demonstrate how to use predict.mlm
to get our desired result.
## still using previous matrices
training_dataframe <- data.frame(response = I(response), predictors = I(predictors))
fit <- lm(response ~ predictors, data = training_dataframe)
newdat <- data.frame(predictors = I(test_set))
pred <- predict(fit, newdat)
# [,1] [,2]
#[1,] -2.905469 1.702384
#[2,] 1.871755 -1.236240
使用data.frame()
时请注意I()
.当我们要获取矩阵数据框时,这是必须的.您可以比较以下两者之间的区别:
Note the I()
when I use data.frame()
. This is a must when we want to obtain a data frame of matrices. You can compare the difference between:
str(data.frame(response = I(response), predictors = I(predictors)))
#'data.frame': 10 obs. of 2 variables:
# $ response : AsIs [1:10, 1:2] 1.262954.... -0.32623.... 1.329799.... 1.272429.... 0.414641.... ...
# $ predictors: AsIs [1:10, 1:3] -0.22426.... 0.377395.... 0.133336.... 0.804189.... -0.05710.... ...
str(data.frame(response = response, predictors = predictors))
#'data.frame': 10 obs. of 5 variables:
# $ response.1 : num 1.263 -0.326 1.33 1.272 0.415 ...
# $ response.2 : num 0.764 -0.799 -1.148 -0.289 -0.299 ...
# $ predictors.1: num -0.2243 0.3774 0.1333 0.8042 -0.0571 ...
# $ predictors.2: num -0.236 -0.543 -0.433 -0.649 0.727 ...
# $ predictors.3: num 1.758 0.561 -0.453 -0.832 -1.167 ...
如果没有I()
来保护矩阵输入,数据将变得混乱.令人惊讶的是,这不会对lm
造成问题,但是,如果您不使用I()
,则predict.mlm
将很难获得正确的预测矩阵.
Without I()
to protect the matrix input, data are messy. It is amazing that this will not cause problem to lm
, but predict.mlm
will have a hard time obtaining the correct matrix for prediction, if you don't use I()
.
好吧,在这种情况下,我建议使用列表"而不是数据框". lm
中的data
自变量以及predict
中的newdata
自变量允许列表输入. 列表"是一个比数据帧更通用的结构,它可以毫无困难地保存任何数据结构.我们可以做到:
Well, I would recommend using a "list" instead of a "data frame" in this case. data
argument in lm
as well newdata
argument in predict
allows list input. A "list" is a more general structure than a data frame, which can hold any data structure without difficulty. We can do:
## still using previous matrices
training_list <- list(response = response, predictors = predictors)
fit <- lm(response ~ predictors, data = training_list)
newdat <- list(predictors = test_set)
pred <- predict(fit, newdat)
# [,1] [,2]
#[1,] -2.905469 1.702384
#[2,] 1.871755 -1.236240
也许到最后,我应该强调使用公式接口而不是矩阵接口总是安全的.我将使用R内置数据集trees
作为可重现的示例.
Perhaps in the very end, I should stress that it is always safe to use formula interface, rather than matrix interface. I will use R built-in dataset trees
as a reproducible example.
fit <- lm(cbind(Girth, Height) ~ Volume, data = trees)
## use the first two rows as prediction dataset
predict(fit, newdata = trees[1:2, ])
# Girth Height
#1 9.579568 71.39192
#2 9.579568 71.39192
也许您仍然记得我说过predict.mlm*
太原始而不能支持se.fit
的说法.这是测试它的机会.
Perhaps you still remember my saying that predict.mlm*
is too primitive to support se.fit
. This is the chance to test it.
predict(fit, newdata = trees[1:2, ], se.fit = TRUE)
#Error in predict.mlm(fit, newdata = trees[1:2, ], se.fit = TRUE) :
# the 'se.fit' argument is not yet implemented for "mlm" objects
糟糕...置信度/预测间隔(实际上没有计算标准误差的能力,就不可能产生这些间隔)?好吧,predict.mlm*
只会忽略它.
Oops... How about confidence / prediction intervals (actually without the ability to compute standard error it is impossible to produce those intervals)? Well, predict.mlm*
will just ignore it.
predict(fit, newdata = trees[1:2, ], interval = "confidence")
# Girth Height
#1 9.579568 71.39192
#2 9.579568 71.39192
因此,与predict.lm
相比,有很大不同.
So this is so different compared with predict.lm
.
这篇关于从`lm()`预测'mlm'线性模型对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!