问题描述
这是我运行的代码
fun <- function(x) {1 + 3*sin(4*pi*x-pi)}
set.seed(1)
num.samples <- 1000
x <- runif(num.samples)
y <- fun(x) + rnorm(num.samples) * 1.5
fit <- smooth.spline(x, y, all.knots=TRUE, df=3)
尽管df=3
,当我检查拟合模型时,输出为
Despite df=3
, when I checked the fitted model, the output was
Call:
smooth.spline(x = x, y = y, df = 3, all.knots = TRUE)
Smoothing Parameter spar= 1.499954 lambda= 0.002508571 (26 iterations)
Equivalent Degrees of Freedom (Df): 9.86422
有人可以帮忙吗?谢谢!
Could someone please help? Thanks!
推荐答案
请注意,从R-3.4.0(2017-04-21)起,smooth.spline
可以接受通过新添加的参数λ
的直接指定lambda
.但是在估计过程中仍将转换为内部的spar
.因此,以下答案不受影响.
Note that from R-3.4.0 (2017-04-21), smooth.spline
can accept direct specification of λ
by a newly added argument lambda
. But it will still be converted to the internal one spar
during estimation. So the following answer is not affected.
平滑参数λ
/spar
位于平滑度控制的中心
Smoothing parameter λ
/ spar
lies in the centre of smoothness control
平滑度由平滑参数λ
控制.smooth.spline()
使用内部平滑参数spar
而不是λ
:
Smoothness is controlled by smoothing parameter λ
.smooth.spline()
uses an internal smoothing parameter spar
rather than λ
:
spar = s0 + 0.0601 * log(λ)
此类对数变换对于进行无约束最小化非常必要,例如GCV/CV.用户可以指定spar
间接指定λ
.当spar
线性增长时,λ
将呈指数增长.因此,很少需要使用较大的spar
值.
Such logarithm transform is necessary in order to do unconstrained minimization, like GCV/CV. User can specify spar
to indirectly specify λ
. When spar
grows linearly, λ
will grow exponentially. Thus there is rarely the need for using large spar
value.
自由度df
,也根据λ
定义:
其中,X
是具有B样条的模型矩阵,S
是惩罚矩阵.
where X
is the model matrix with B-spline basis and S
is the penalty matrix.
您可以检查它们与数据集的关系:
You can have a check on their relationships with your dataset:
spar <- seq(1, 2.5, by = 0.1)
a <- sapply(spar, function (spar_i) unlist(smooth.spline(x, y, all.knots=TRUE, spar = spar_i)[c("df","lambda")]))
让我们的草图df ~ spar
,λ ~ spar
和log(λ) ~ spar
:
par(mfrow = c(1,3))
plot(spar, a[1, ], type = "b", main = "df ~ spar",
xlab = "spar", ylab = "df")
plot(spar, a[2, ], type = "b", main = "lambda ~ spar",
xlab = "spar", ylab = "lambda")
plot(spar, log(a[2,]), type = "b", main = "log(lambda) ~ spar",
xlab = "spar", ylab = "log(lambda)")
注意λ
和spar
的急剧增长,log(λ)
和spar
之间的线性关系以及df
和spar
之间的相对平滑的关系.
Note the radical growth of λ
with spar
, the linear relationship between log(λ)
and spar
, and the relatively smooth relationship between df
and spar
.
smooth.spline()
spar
smooth.spline()
fitting iterations for spar
如果像sapply()
中那样手动指定spar
的值,则选择spar
不会进行合适的迭代;否则,smooth.spline()
需要遍历许多spar
值.如果我们
If we manually specify the value of spar
, like what we did in the sapply()
, no fitting iterations is done for selecting spar
; otherwise smooth.spline()
needs iterate through a number of spar
values. If we
- 指定
cv = TRUE / FALSE
,拟合迭代旨在最小化CV/GCV分数; - 指定
df = mydf
,拟合迭代旨在最小化(df(spar) - mydf) ^ 2
.
- specify
cv = TRUE / FALSE
, fitting iterations aims to minimize CV/GCV score; - specify
df = mydf
, fitting iterations aims to minimize(df(spar) - mydf) ^ 2
.
最小化GCV很容易遵循.我们不在乎GCV得分,但在乎相应的spar
.相反,当最小化(df(spar) - mydf)^2
时,我们通常关心的是迭代结束时的df
值,而不是spar
!但是请记住,这是一个最小化问题,我们永远不能保证最终的df
与我们的目标值mydf
匹配.
Minimizing GCV is easy to follow. We don't care about the GCV score, but care the corresponding spar
. On the contrary, when minimizing (df(spar) - mydf)^2
, we often care about the df
value at the end of iteration rather than spar
! But bearing in mind that this is an minimization problem, we are never guaranteed that the final df
matches our target value mydf
.
为什么放df = 3
,却得到df = 9.864?
Why you put df = 3
, but get df = 9.864?
迭代结束可能意味着达到最小值,或者达到搜索边界,或者达到最大迭代次数.
The end of iteration, could either implies hitting a minimum, or reaching searching boundary, or reaching maximum number of iterations.
我们离最大迭代限制(默认值500)还差得远;但是我们没有达到最低要求.好吧,我们可能会到达边界.
We are far from maximum iterations limit (default 500); yet we do not hit the minimum. Well, we might reach the boundary.
不要专注于df
,请考虑spar
.
smooth.spline(x, y, all.knots=TRUE, df=3)$spar # 1.4999
默认情况下,根据?smooth.spline
,smooth.spline()
在[-1.5, 1.5]
之间搜索spar
.即,当您放置df = 3
时,最小化将在搜索边界处终止,而不是按df = 3
.
According to ?smooth.spline
, by default, smooth.spline()
searches spar
between [-1.5, 1.5]
. I.e., when you put df = 3
, minimization terminates at the searching boundary, rather than hitting df = 3
.
再次查看我们的df
和spar
之间关系的图表.从图中看来,我们需要在2附近有一些spar
值才能生成df = 3
.
Have a look at our graph of the relationship between df
and spar
, again. From the figure, it looks like that we need some spar
value near 2 in order to result in df = 3
.
让我们使用control.spar
参数:
fit <- smooth.spline(x, y, all.knots=TRUE, df=3, control.spar = list(high = 2.5))
# Smoothing Parameter spar= 1.859066 lambda= 0.9855336 (14 iterations)
# Equivalent Degrees of Freedom (Df): 3.000305
现在,您看到的是df = 3
.我们需要一个spar = 1.86
.
Now you see, you end up with df = 3
. And we need a spar = 1.86
.
一个更好的建议:请勿使用all.knots = TRUE
A better suggestion: Do not use all.knots = TRUE
看,您有1000个数据.通过all.knots = TRUE
,您将使用1000个参数.希望以df = 3
结尾表示已抑制1000个参数中的997个.想象一下,λ
因此需要多少spar
!
Look, you have 1000 data. With all.knots = TRUE
you will use 1000 parameters. Wishing to end up with df = 3
implies that 997 out of 1000 parameters are suppressed. Imagine how large a λ
hence spar
you need!
尝试改用惩罚性回归样条曲线.将200个参数抑制为3个绝对容易得多:
Try using penalized regression spline instead. Suppressing 200 parameters to 3 is definitely much easier:
fit <- smooth.spline(x, y, nknots = 200, df=3) ## using 200 knots
# Smoothing Parameter spar= 1.317883 lambda= 0.9853648 (16 iterations)
# Equivalent Degrees of Freedom (Df): 3.000386
现在,在没有spar
控件的情况下,您将得到df = 3
.
Now, you end up with df = 3
without spar
control.
这篇关于smooth.spline():拟合的模型与用户指定的自由度不匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!