问题描述
我需要对数据帧dat
的放大子集应用lm()
,同时为下一个观察做出预测.例如,我正在做:
I need to apply lm()
to an enlarging subset of my dataframe dat
, while making prediction for the next observation. For example, I am doing:
fit model predict
---------- -------
dat[1:3, ] dat[4, ]
dat[1:4, ] dat[5, ]
. .
. .
dat[-1, ] dat[nrow(dat), ]
我知道我应该对特定的子集做什么(与此问题相关: predict()和newdata-这是如何工作的? ).例如,我要预测最后一行
I know what I should do for a particular subset (related to this question: predict() and newdata - How does this work?). For example to predict the last row, I do
dat1 = dat[1:(nrow(dat)-1), ]
dat2 = dat[nrow(dat), ]
fit = lm(log(clicks) ~ log(v1) + log(v12), data=dat1)
predict.fit = predict(fit, newdata=dat2, se.fit=TRUE)
如何对所有子集自动执行此操作,并有可能将我想要的内容提取到表中?
How can I do this automatically for all subsets, and potentially extract what I want into a table?
- 在
fit
中,我需要summary(fit)$adj.r.squared
; - 在
predict.fit
中,我需要predict.fit$fit
值.
- From
fit
, I'd need thesummary(fit)$adj.r.squared
; - From
predict.fit
I'd needpredict.fit$fit
value.
谢谢.
推荐答案
(高效)解决方案
这是您可以做的:
p <- 3 ## number of parameters in lm()
n <- nrow(dat) - 1
## a function to return what you desire for subset dat[1:x, ]
bundle <- function(x) {
fit <- lm(log(clicks) ~ log(v1) + log(v12), data = dat, subset = 1:x, model = FALSE)
pred <- predict(fit, newdata = dat[x+1, ], se.fit = TRUE)
c(summary(fit)$adj.r.squared, pred$fit, pred$se.fit)
}
## rolling regression / prediction
result <- t(sapply(p:n, bundle))
colnames(result) <- c("adj.r2", "prediction", "se")
请注意,我已经在bundle
函数中做了几件事:
Note I have done several things inside the bundle
function:
- 我已经使用
subset
参数来选择适合的子集 - 我已经使用过
model = FALSE
来不保存模型框架,因此我们节省了工作空间
- I have used
subset
argument for selecting a subset to fit - I have used
model = FALSE
to not save model frame hence we save workspace
总体而言,没有明显的循环,但是使用了sapply
.
Overall, there is no obvious loop, but sapply
is used.
- 拟合从
p
开始,这是拟合具有p
系数的模型所需的最少数据量; - 拟合终止于
nrow(dat) - 1
,因为我们至少需要最后一列进行预测.
- Fitting starts from
p
, the minimum number of data required to fit a model withp
coefficients; - Fitting terminates at
nrow(dat) - 1
, as we at least need the final column for prediction.
测试
示例数据(包含30个观察值")
Example data (with 30 "observations")
dat <- data.frame(clicks = runif(30, 1, 100), v1 = runif(30, 1, 100),
v12 = runif(30, 1, 100))
上面的应用代码给出results
(总共27行,输出被截断为5行)
Applying code above gives results
(27 rows in total, truncated output for 5 rows)
adj.r2 prediction se
[1,] NaN 3.881068 NaN
[2,] 0.106592619 3.676821 0.7517040
[3,] 0.545993989 3.892931 0.2758347
[4,] 0.622612495 3.766101 0.1508270
[5,] 0.180462206 3.996344 0.2059014
第一列是拟合模型的调整后R.squared值,而第二列是预测值. adj.r2
的第一个值是NaN
,因为我们拟合的第一个模型对3个数据点具有3个系数,因此没有可用的统计数据. se
也会发生同样的情况,因为拟合线没有零残差,因此进行预测时没有不确定性.
The first column is the adjusted-R.squared value for fitted model, while the second column is the prediction. The first value for adj.r2
is NaN
, because the first model we fit has 3 coefficients for 3 data points, hence no sensible statistics is available. The same happens to se
as well, as the fitted line has no 0 residuals, so prediction is done without uncertainty.
这篇关于使用lm()和predict()进行滚动回归和预测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!