问题描述
我正在使用以下地理叠加模型
I am using the following geoadditive model
library(gamair)
library(mgcv)
data(mack)
mack$log.net.area <- log(mack$net.area)
gm2 <- gam(egg.count ~ s(lon,lat,bs="gp",k=100,m=c(2,10,1)) +
s(I(b.depth^.5)) +
s(c.dist) +
s(temp.20m) +
offset(log.net.area),
data = mack, family = tw, method = "REML")
我如何使用它来预测在没有协变量数据的新位置(lon/lat)
中的egg.count
的值,例如在kriging
中?
How can I use it to predict the value of egg.count
at new locations (lon/lat)
where I don't have covariate data, as in kriging
?
例如说我想在这些新位置预测egg.count
For example say I want to predict egg.count
at these new locations
lon lat
1 -3.00 44
4 -2.75 44
7 -2.50 44
10 -2.25 44
13 -2.00 44
16 -1.75 44
但是在这里我不知道协变量的值(b.depth
,c.dist
,temp.20m
,log.net.area
).
but here I don't know the values of the covariates (b.depth
, c.dist
, temp.20m
, log.net.area
).
推荐答案
predict
仍然要求模型中使用的所有变量都显示在newdata
中,但是您可以传入一些任意值,例如0
s ,针对您没有的那些协变量,然后使用type = "terms"
和terms = name_of_the_wanted_smooth_term
进行操作.使用
predict
still requires all variables used in your model to be presented in newdata
, but you can pass in some arbitrary values, like 0
s, to those covariates you don't have, then use type = "terms"
and terms = name_of_the_wanted_smooth_term
to proceed. Use
sapply(gm2$smooth, "[[", "label")
#[1] "s(lon,lat)" "s(I(b.depth^0.5))" "s(c.dist)"
#[4] "s(temp.20m)"
检查模型中的平滑项.
## new spatial locations to predict
newdat <- read.table(text = "lon lat
1 -3.00 44
4 -2.75 44
7 -2.50 44
10 -2.25 44
13 -2.00 44
16 -1.75 44")
## "garbage" values, just to pass the variable names checking in `predict.gam`
newdat[c("b.depth", "c.dist", "temp.20m", "log.net.area")] <- 0
## prediction on the link scale
pred_link <- predict(gm2, newdata = newdat, type = "terms", terms = "s(lon,lat)")
# s(lon,lat)
#1 -1.9881967
#4 -1.9137971
#7 -1.6365945
#10 -1.1247837
#13 -0.7910023
#16 -0.7234683
#attr(,"constant")
#(Intercept)
# 2.553535
## simplify to vector
pred_link <- attr(pred_link, "constant") + rowSums(pred_link)
#[1] 0.5653381 0.6397377 0.9169403 1.4287511 1.7625325 1.8300665
## prediction on the response scale
pred_response <- gm2$family$linkinv(pred_link)
#[1] 1.760043 1.895983 2.501625 4.173484 5.827176 6.234301
如果我要为特定的平滑项进行预测,通常不使用predict.gam
. predict.gam
的逻辑是首先对所有术语进行预测,也就是说,与进行type = "terms"
相同.然后
I don't normally use predict.gam
if I want to do prediction for a specific smooth term. The logic of predict.gam
is to do prediction for all terms first, that is, the same as your doing type = "terms"
. Then
- 如果
type = "link"
,则对所有按词项的预测进行rowSums
加上截距(可能使用offset
); - 如果未指定
type = "terms"
和"terms"
或"exclude"
,则按原样返回结果; - 如果
type = "terms"
并且您已指定"terms"
和/或"exclude"
,则将进行一些后期处理以删除您不想要的字词,只给您想要的字词.
- if
type = "link"
, do arowSums
on all term-wise predictions plus an intercept (possibly withoffset
); - if
type = "terms"
, and"terms"
or"exclude"
are unspecified, return the result as it is; - if
type = "terms"
and you have specified"terms"
and / or"exclude"
, some post-process is done to drop terms you don't want and only give you those you want.
因此,即使您只想要一个术语,predict.gam
仍将对所有术语进行计算.
So, predict.gam
will always do computation for all terms, even if you just want a single term.
知道这背后的效率低下,这就是我要做的事情
Knowing the inefficiency behind this, this is what I will do:
sm <- gm2$smooth[[1]] ## extract smooth construction info for `s(lon,lat)`
Xp <- PredictMat(sm, newdat) ## predictor matrix
b <- gm2$coefficients[with(sm, first.para:last.para)] ## coefficients for this term
pred_link <- c(Xp %*% b) + gm2$coef[[1]] ## this term + intercept
#[1] 0.5653381 0.6397377 0.9169403 1.4287511 1.7625325 1.8300665
pred_response <- gm2$family$linkinv(pred_link)
#[1] 1.760043 1.895983 2.501625 4.173484 5.827176 6.234301
你看,我们得到相同的结果.
You see, we get the same result.
将根据这些垃圾值进行一些垃圾预测,但是predict.gam
最终将其丢弃.
Some garbage prediction will be made at those garbage values, but predict.gam
discards them in the end.
就我而言,
对于像mgcv
这样的大软件包,代码维护非常困难.如果您希望满足每个用户的需求,则需要对代码进行重大更改.显然,当像您这样的人只希望它预测一定的平滑度时,我在这里描述的predict.gam
逻辑将效率很低.从理论上讲,如果是这种情况,则newdata
中的变量名检查可以忽略用户不需要的那些术语.但是,这需要对predict.gam
进行重大更改,并且可能由于代码更改而引入许多错误.此外,您必须向CRAN提交变更日志,而CRAN可能不愿意看到这种急剧变化.
Code maintenance is, as far as I feel, very difficult for a big package like mgcv
. The code needs be changed significantly if you want it to suit every user's need. Obviously the predict.gam
logic as I described here will be inefficient when people, like you, just want it to predict a certain smooth. And in theory if this is the case, variable names checking in newdata
can ignore those terms not wanted by users. But, that requires significant change of predict.gam
, and could potentially introduce many bugs due to code changes. Furthermore, you have to submit a changelog to CRAN, and CRAN may just not be happy to see this drastic change.
西蒙(Simon)曾经分享他的感受:有很多人告诉我,我应该这样写mgcv
,但我简直不能.是的,对像他这样的软件包作者/维护者表示同情.
Simon once shared his feelings: there are many people telling me, I should write mgcv
as this or as that, but I simply can't. Yeah, give some sympathy to a package author / maintainer like him.
这取决于您是否提供b.depth
,c.dist
,temp.20m
,log.net.area
的协变量值.但是由于您没有在新的位置放置它们,因此可以仅假设这些影响为0
.
It will depend if you provide covariates values for b.depth
, c.dist
, temp.20m
, log.net.area
. But since you don't have them at new locations, the prediction is just to assume these effects to be 0
.
您仅在预测空间场/平滑度.在GAM方法中,空间场被建模为均值的一部分,而不是方差-协方差(如克里金法中那样),因此我认为您在这里使用残差"是不正确的.
You are only predicting the spatial field / smooth. In GAM approach the spatial field is modeled as part of mean, not variance-covariance (as in kriging), so I think your use of "residuals" is not correct here.
正确.您可以尝试使用predict.gam
(带或不带terms = "s(lon,lat)"
)来帮助您摘要输出.看看当您更改传递给其他协变量的垃圾值时,它是如何变化的.
Correct. You can try predict.gam
with or without terms = "s(lon,lat)"
to help you digest the output. See how it changes when you vary garbage values passed to other covariates.
## a possible set of garbage values for covariates
newdat[c("b.depth", "c.dist", "temp.20m", "log.net.area")] <- 0
predict(gm2, newdat, type = "terms")
# s(lon,lat) s(I(b.depth^0.5)) s(c.dist) s(temp.20m)
#1 -1.9881967 -1.05514 0.4739174 -1.466549
#4 -1.9137971 -1.05514 0.4739174 -1.466549
#7 -1.6365945 -1.05514 0.4739174 -1.466549
#10 -1.1247837 -1.05514 0.4739174 -1.466549
#13 -0.7910023 -1.05514 0.4739174 -1.466549
#16 -0.7234683 -1.05514 0.4739174 -1.466549
#attr(,"constant")
#(Intercept)
# 2.553535
predict(gm2, newdat, type = "terms", terms = "s(lon,lat)")
# s(lon,lat)
#1 -1.9881967
#4 -1.9137971
#7 -1.6365945
#10 -1.1247837
#13 -0.7910023
#16 -0.7234683
#attr(,"constant")
#(Intercept)
# 2.553535
## another possible set of garbage values for covariates
newdat[c("b.depth", "c.dist", "temp.20m", "log.net.area")] <- 1
# s(lon,lat) s(I(b.depth^0.5)) s(c.dist) s(temp.20m)
#1 -1.9881967 -0.9858522 -0.3749018 -1.269878
#4 -1.9137971 -0.9858522 -0.3749018 -1.269878
#7 -1.6365945 -0.9858522 -0.3749018 -1.269878
#10 -1.1247837 -0.9858522 -0.3749018 -1.269878
#13 -0.7910023 -0.9858522 -0.3749018 -1.269878
#16 -0.7234683 -0.9858522 -0.3749018 -1.269878
#attr(,"constant")
#(Intercept)
# 2.553535
predict(gm2, newdat, type = "terms", terms = "s(lon,lat)")
# s(lon,lat)
#1 -1.9881967
#4 -1.9137971
#7 -1.6365945
#10 -1.1247837
#13 -0.7910023
#16 -0.7234683
#attr(,"constant")
#(Intercept)
# 2.553535
这篇关于具有"gp"的GAM更顺畅:在新位置进行预测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!