使用lm()和predict()进行滚动回归和预测

本文介绍了使用lm()和predict()进行滚动回归和预测的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要对数据帧dat的放大子集应用lm()，同时为下一个观察做出预测.例如，我正在做:

I need to apply lm() to an enlarging subset of my dataframe dat, while making prediction for the next observation. For example, I am doing:

fit model      predict
----------     -------
dat[1:3, ]     dat[4, ]
dat[1:4, ]     dat[5, ]
    .             .
    .             .
dat[-1, ]      dat[nrow(dat), ]

我知道我应该对特定的子集做什么(与此问题相关: predict()和newdata-这是如何工作的? ).例如，我要预测最后一行

I know what I should do for a particular subset (related to this question: predict() and newdata - How does this work?). For example to predict the last row, I do

dat1 = dat[1:(nrow(dat)-1), ]
dat2 = dat[nrow(dat), ]

fit = lm(log(clicks) ~ log(v1) + log(v12), data=dat1)
predict.fit = predict(fit, newdata=dat2, se.fit=TRUE)

如何对所有子集自动执行此操作，并有可能将我想要的内容提取到表中?

How can I do this automatically for all subsets, and potentially extract what I want into a table?

在fit中，我需要summary(fit)$adj.r.squared;
在predict.fit中，我需要predict.fit$fit值.

From fit, I'd need the summary(fit)$adj.r.squared;
From predict.fit I'd need predict.fit$fit value.

谢谢.

推荐答案

(高效)解决方案

这是您可以做的:

p <- 3  ## number of parameters in lm()
n <- nrow(dat) - 1

## a function to return what you desire for subset dat[1:x, ]
bundle <- function(x) {
  fit <- lm(log(clicks) ~ log(v1) + log(v12), data = dat, subset = 1:x, model = FALSE)
  pred <- predict(fit, newdata = dat[x+1, ], se.fit = TRUE)
  c(summary(fit)$adj.r.squared, pred$fit, pred$se.fit)
  }

## rolling regression / prediction
result <- t(sapply(p:n, bundle))
colnames(result) <- c("adj.r2", "prediction", "se")

请注意，我已经在bundle函数中做了几件事:

Note I have done several things inside the bundle function:

我已经使用subset参数来选择适合的子集
我已经使用过model = FALSE来不保存模型框架，因此我们节省了工作空间

I have used subset argument for selecting a subset to fit
I have used model = FALSE to not save model frame hence we save workspace

总体而言，没有明显的循环，但是使用了sapply.

Overall, there is no obvious loop, but sapply is used.

拟合从p开始，这是拟合具有p系数的模型所需的最少数据量；
拟合终止于nrow(dat) - 1，因为我们至少需要最后一列进行预测.

Fitting starts from p, the minimum number of data required to fit a model with p coefficients;
Fitting terminates at nrow(dat) - 1, as we at least need the final column for prediction.

测试

示例数据(包含30个观察值")

Example data (with 30 "observations")

dat <- data.frame(clicks = runif(30, 1, 100), v1 = runif(30, 1, 100),
                  v12 = runif(30, 1, 100))

上面的应用代码给出results(总共27行，输出被截断为5行)

Applying code above gives results (27 rows in total, truncated output for 5 rows)

            adj.r2 prediction        se
 [1,]          NaN   3.881068       NaN
 [2,]  0.106592619   3.676821 0.7517040
 [3,]  0.545993989   3.892931 0.2758347
 [4,]  0.622612495   3.766101 0.1508270
 [5,]  0.180462206   3.996344 0.2059014

第一列是拟合模型的调整后R.squared值，而第二列是预测值. adj.r2的第一个值是NaN，因为我们拟合的第一个模型对3个数据点具有3个系数，因此没有可用的统计数据. se也会发生同样的情况，因为拟合线没有零残差，因此进行预测时没有不确定性.

The first column is the adjusted-R.squared value for fitted model, while the second column is the prediction. The first value for adj.r2 is NaN, because the first model we fit has 3 coefficients for 3 data points, hence no sensible statistics is available. The same happens to se as well, as the fitted line has no 0 residuals, so prediction is done without uncertainty.

这篇关于使用lm()和predict()进行滚动回归和预测的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！