具有"gp"的GAM更顺畅:在新位置进行预测

本文介绍了具有"gp"的GAM更顺畅:在新位置进行预测的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用以下地理叠加模型

I am using the following geoadditive model

library(gamair)
library(mgcv)

data(mack)    
mack$log.net.area <- log(mack$net.area)

gm2 <- gam(egg.count ~ s(lon,lat,bs="gp",k=100,m=c(2,10,1)) +
                       s(I(b.depth^.5)) +
                       s(c.dist) +
                       s(temp.20m) +
                       offset(log.net.area),
                       data = mack, family = tw, method = "REML")

我如何使用它来预测在没有协变量数据的新位置(lon/lat)中的egg.count的值，例如在kriging中?

How can I use it to predict the value of egg.count at new locations (lon/lat) where I don't have covariate data, as in kriging?

例如说我想在这些新位置预测egg.count

For example say I want to predict egg.count at these new locations

    lon lat
1  -3.00  44
4  -2.75  44
7  -2.50  44
10 -2.25  44
13 -2.00  44
16 -1.75  44

但是在这里我不知道协变量的值(b.depth，c.dist，temp.20m，log.net.area).

but here I don't know the values of the covariates (b.depth, c.dist, temp.20m, log.net.area).

推荐答案

predict仍然要求模型中使用的所有变量都显示在newdata中，但是您可以传入一些任意值，例如0 s ，针对您没有的那些协变量，然后使用type = "terms"和terms = name_of_the_wanted_smooth_term进行操作.使用

predict still requires all variables used in your model to be presented in newdata, but you can pass in some arbitrary values, like 0s, to those covariates you don't have, then use type = "terms" and terms = name_of_the_wanted_smooth_term to proceed. Use

sapply(gm2$smooth, "[[", "label")
#[1] "s(lon,lat)"        "s(I(b.depth^0.5))" "s(c.dist)"        
#[4] "s(temp.20m)"

检查模型中的平滑项.

## new spatial locations to predict
newdat <- read.table(text = "lon lat
                             1  -3.00  44
                             4  -2.75  44
                             7  -2.50  44
                             10 -2.25  44
                             13 -2.00  44
                             16 -1.75  44")

## "garbage" values, just to pass the variable names checking in `predict.gam`
newdat[c("b.depth", "c.dist", "temp.20m", "log.net.area")] <- 0

## prediction on the link scale
pred_link <- predict(gm2, newdata = newdat, type = "terms", terms = "s(lon,lat)")
#   s(lon,lat)
#1  -1.9881967
#4  -1.9137971
#7  -1.6365945
#10 -1.1247837
#13 -0.7910023
#16 -0.7234683
#attr(,"constant")
#(Intercept) 
#   2.553535 

## simplify to vector
pred_link <- attr(pred_link, "constant") + rowSums(pred_link)
#[1] 0.5653381 0.6397377 0.9169403 1.4287511 1.7625325 1.8300665

## prediction on the response scale
pred_response <- gm2$family$linkinv(pred_link)
#[1] 1.760043 1.895983 2.501625 4.173484 5.827176 6.234301

如果我要为特定的平滑项进行预测，通常不使用predict.gam. predict.gam的逻辑是首先对所有术语进行预测，也就是说，与进行type = "terms"相同.然后

I don't normally use predict.gam if I want to do prediction for a specific smooth term. The logic of predict.gam is to do prediction for all terms first, that is, the same as your doing type = "terms". Then

如果type = "link"，则对所有按词项的预测进行rowSums加上截距(可能使用offset)；
如果未指定type = "terms"和"terms"或"exclude"，则按原样返回结果；
如果type = "terms"并且您已指定"terms"和/或"exclude"，则将进行一些后期处理以删除您不想要的字词，只给您想要的字词.

if type = "link", do a rowSums on all term-wise predictions plus an intercept (possibly with offset);
if type = "terms", and "terms" or "exclude" are unspecified, return the result as it is;
if type = "terms" and you have specified "terms" and / or "exclude", some post-process is done to drop terms you don't want and only give you those you want.

因此，即使您只想要一个术语，predict.gam仍将对所有术语进行计算.

So, predict.gam will always do computation for all terms, even if you just want a single term.

知道这背后的效率低下，这就是我要做的事情

Knowing the inefficiency behind this, this is what I will do:

sm <- gm2$smooth[[1]]  ## extract smooth construction info for `s(lon,lat)`
Xp <- PredictMat(sm, newdat)  ## predictor matrix
b <- gm2$coefficients[with(sm, first.para:last.para)]  ## coefficients for this term
pred_link <- c(Xp %*% b) + gm2$coef[[1]]  ## this term + intercept
#[1] 0.5653381 0.6397377 0.9169403 1.4287511 1.7625325 1.8300665
pred_response <- gm2$family$linkinv(pred_link)
#[1] 1.760043 1.895983 2.501625 4.173484 5.827176 6.234301

你看，我们得到相同的结果.

You see, we get the same result.

将根据这些垃圾值进行一些垃圾预测，但是predict.gam最终将其丢弃.

Some garbage prediction will be made at those garbage values, but predict.gam discards them in the end.

就我而言，

对于像mgcv这样的大软件包，代码维护非常困难.如果您希望满足每个用户的需求，则需要对代码进行重大更改.显然，当像您这样的人只希望它预测一定的平滑度时，我在这里描述的predict.gam逻辑将效率很低.从理论上讲，如果是这种情况，则newdata中的变量名检查可以忽略用户不需要的那些术语.但是，这需要对predict.gam进行重大更改，并且可能由于代码更改而引入许多错误.此外，您必须向CRAN提交变更日志，而CRAN可能不愿意看到这种急剧变化.

Code maintenance is, as far as I feel, very difficult for a big package like mgcv. The code needs be changed significantly if you want it to suit every user's need. Obviously the predict.gam logic as I described here will be inefficient when people, like you, just want it to predict a certain smooth. And in theory if this is the case, variable names checking in newdata can ignore those terms not wanted by users. But, that requires significant change of predict.gam, and could potentially introduce many bugs due to code changes. Furthermore, you have to submit a changelog to CRAN, and CRAN may just not be happy to see this drastic change.

西蒙(Simon)曾经分享他的感受:有很多人告诉我，我应该这样写mgcv，但我简直不能.是的，对像他这样的软件包作者/维护者表示同情.

Simon once shared his feelings: there are many people telling me, I should write mgcv as this or as that, but I simply can't. Yeah, give some sympathy to a package author / maintainer like him.

这取决于您是否提供b.depth，c.dist，temp.20m，log.net.area的协变量值.但是由于您没有在新的位置放置它们，因此可以仅假设这些影响为0.

It will depend if you provide covariates values for b.depth, c.dist, temp.20m, log.net.area. But since you don't have them at new locations, the prediction is just to assume these effects to be 0.

您仅在预测空间场/平滑度.在GAM方法中，空间场被建模为均值的一部分，而不是方差-协方差(如克里金法中那样)，因此我认为您在这里使用残差"是不正确的.

You are only predicting the spatial field / smooth. In GAM approach the spatial field is modeled as part of mean, not variance-covariance (as in kriging), so I think your use of "residuals" is not correct here.

正确.您可以尝试使用predict.gam(带或不带terms = "s(lon,lat)")来帮助您摘要输出.看看当您更改传递给其他协变量的垃圾值时，它是如何变化的.

Correct. You can try predict.gam with or without terms = "s(lon,lat)" to help you digest the output. See how it changes when you vary garbage values passed to other covariates.

## a possible set of garbage values for covariates
newdat[c("b.depth", "c.dist", "temp.20m", "log.net.area")] <- 0

predict(gm2, newdat, type = "terms")
#   s(lon,lat) s(I(b.depth^0.5)) s(c.dist) s(temp.20m)
#1  -1.9881967          -1.05514 0.4739174   -1.466549
#4  -1.9137971          -1.05514 0.4739174   -1.466549
#7  -1.6365945          -1.05514 0.4739174   -1.466549
#10 -1.1247837          -1.05514 0.4739174   -1.466549
#13 -0.7910023          -1.05514 0.4739174   -1.466549
#16 -0.7234683          -1.05514 0.4739174   -1.466549
#attr(,"constant")
#(Intercept) 
#   2.553535 

predict(gm2, newdat, type = "terms", terms = "s(lon,lat)")
#   s(lon,lat)
#1  -1.9881967
#4  -1.9137971
#7  -1.6365945
#10 -1.1247837
#13 -0.7910023
#16 -0.7234683
#attr(,"constant")
#(Intercept) 
#   2.553535

## another possible set of garbage values for covariates
newdat[c("b.depth", "c.dist", "temp.20m", "log.net.area")] <- 1
#   s(lon,lat) s(I(b.depth^0.5))  s(c.dist) s(temp.20m)
#1  -1.9881967        -0.9858522 -0.3749018   -1.269878
#4  -1.9137971        -0.9858522 -0.3749018   -1.269878
#7  -1.6365945        -0.9858522 -0.3749018   -1.269878
#10 -1.1247837        -0.9858522 -0.3749018   -1.269878
#13 -0.7910023        -0.9858522 -0.3749018   -1.269878
#16 -0.7234683        -0.9858522 -0.3749018   -1.269878
#attr(,"constant")
#(Intercept) 
#   2.553535 

predict(gm2, newdat, type = "terms", terms = "s(lon,lat)")
#   s(lon,lat)
#1  -1.9881967
#4  -1.9137971
#7  -1.6365945
#10 -1.1247837
#13 -0.7910023
#16 -0.7234683
#attr(,"constant")
#(Intercept) 
#   2.553535

这篇关于具有"gp"的GAM更顺畅:在新位置进行预测的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！