smooth.spline():拟合的模型与用户指定的自由度不匹配

本文介绍了smooth.spline():拟合的模型与用户指定的自由度不匹配的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这是我运行的代码

fun <- function(x) {1 + 3*sin(4*pi*x-pi)}
set.seed(1)
num.samples <- 1000
x <- runif(num.samples)
y <- fun(x) + rnorm(num.samples) * 1.5
fit <- smooth.spline(x, y, all.knots=TRUE, df=3)

尽管df=3，当我检查拟合模型时，输出为

Despite df=3, when I checked the fitted model, the output was

Call:
smooth.spline(x = x, y = y, df = 3, all.knots = TRUE)
Smoothing Parameter  spar= 1.499954  lambda= 0.002508571 (26 iterations)
Equivalent Degrees of Freedom (Df): 9.86422

有人可以帮忙吗?谢谢！

Could someone please help? Thanks!

推荐答案

请注意，从R-3.4.0(2017-04-21)起，smooth.spline可以接受通过新添加的参数λ的直接指定lambda.但是在估计过程中仍将转换为内部的spar.因此，以下答案不受影响.

Note that from R-3.4.0 (2017-04-21), smooth.spline can accept direct specification of λ by a newly added argument lambda. But it will still be converted to the internal one spar during estimation. So the following answer is not affected.

平滑参数λ/spar位于平滑度控制的中心

Smoothing parameter λ / spar lies in the centre of smoothness control

平滑度由平滑参数λ控制.smooth.spline()使用内部平滑参数spar而不是λ:

Smoothness is controlled by smoothing parameter λ.smooth.spline() uses an internal smoothing parameter spar rather than λ:

spar = s0 + 0.0601 * log(λ)

此类对数变换对于进行无约束最小化非常必要，例如GCV/CV.用户可以指定spar间接指定λ.当spar线性增长时，λ将呈指数增长.因此，很少需要使用较大的spar值.

Such logarithm transform is necessary in order to do unconstrained minimization, like GCV/CV. User can specify spar to indirectly specify λ. When spar grows linearly, λ will grow exponentially. Thus there is rarely the need for using large spar value.

自由度df，也根据λ定义:

其中，X是具有B样条的模型矩阵，S是惩罚矩阵.

where X is the model matrix with B-spline basis and S is the penalty matrix.

您可以检查它们与数据集的关系:

You can have a check on their relationships with your dataset:

spar <- seq(1, 2.5, by = 0.1)
a <- sapply(spar, function (spar_i) unlist(smooth.spline(x, y, all.knots=TRUE, spar = spar_i)[c("df","lambda")]))

让我们的草图df ~ spar，λ ~ spar和log(λ) ~ spar:

par(mfrow = c(1,3))
plot(spar, a[1, ], type = "b", main = "df ~ spar",
     xlab = "spar", ylab = "df")
plot(spar, a[2, ], type = "b", main = "lambda ~ spar",
     xlab = "spar", ylab = "lambda")
plot(spar, log(a[2,]), type = "b", main = "log(lambda) ~ spar",
     xlab = "spar", ylab = "log(lambda)")

注意λ和spar的急剧增长，log(λ)和spar之间的线性关系以及df和spar之间的相对平滑的关系.

Note the radical growth of λ with spar, the linear relationship between log(λ) and spar, and the relatively smooth relationship between df and spar.

smooth.spline() spar

smooth.spline() fitting iterations for spar

如果像sapply()中那样手动指定spar的值，则选择spar不会进行合适的迭代；否则，smooth.spline()需要遍历许多spar值.如果我们

If we manually specify the value of spar, like what we did in the sapply(), no fitting iterations is done for selecting spar; otherwise smooth.spline() needs iterate through a number of spar values. If we

指定cv = TRUE / FALSE，拟合迭代旨在最小化CV/GCV分数；
指定df = mydf，拟合迭代旨在最小化(df(spar) - mydf) ^ 2.

specify cv = TRUE / FALSE, fitting iterations aims to minimize CV/GCV score;
specify df = mydf, fitting iterations aims to minimize (df(spar) - mydf) ^ 2.

最小化GCV很容易遵循.我们不在乎GCV得分，但在乎相应的spar.相反，当最小化(df(spar) - mydf)^2时，我们通常关心的是迭代结束时的df值，而不是spar！但是请记住，这是一个最小化问题，我们永远不能保证最终的df与我们的目标值mydf匹配.

Minimizing GCV is easy to follow. We don't care about the GCV score, but care the corresponding spar. On the contrary, when minimizing (df(spar) - mydf)^2, we often care about the df value at the end of iteration rather than spar! But bearing in mind that this is an minimization problem, we are never guaranteed that the final df matches our target value mydf.

为什么放df = 3，却得到df = 9.864?

Why you put df = 3, but get df = 9.864?

迭代结束可能意味着达到最小值，或者达到搜索边界，或者达到最大迭代次数.

The end of iteration, could either implies hitting a minimum, or reaching searching boundary, or reaching maximum number of iterations.

我们离最大迭代限制(默认值500)还差得远；但是我们没有达到最低要求.好吧，我们可能会到达边界.

We are far from maximum iterations limit (default 500); yet we do not hit the minimum. Well, we might reach the boundary.

不要专注于df，请考虑spar.

smooth.spline(x, y, all.knots=TRUE, df=3)$spar   # 1.4999

默认情况下，根据?smooth.spline，smooth.spline()在[-1.5, 1.5]之间搜索spar.即，当您放置df = 3时，最小化将在搜索边界处终止，而不是按df = 3.

According to ?smooth.spline, by default, smooth.spline() searches spar between [-1.5, 1.5]. I.e., when you put df = 3, minimization terminates at the searching boundary, rather than hitting df = 3.

再次查看我们的df和spar之间关系的图表.从图中看来，我们需要在2附近有一些spar值才能生成df = 3.

Have a look at our graph of the relationship between df and spar, again. From the figure, it looks like that we need some spar value near 2 in order to result in df = 3.

让我们使用control.spar参数:

fit <- smooth.spline(x, y, all.knots=TRUE, df=3, control.spar = list(high = 2.5))
# Smoothing Parameter  spar= 1.859066  lambda= 0.9855336 (14 iterations)
# Equivalent Degrees of Freedom (Df): 3.000305

现在，您看到的是df = 3.我们需要一个spar = 1.86.

Now you see, you end up with df = 3. And we need a spar = 1.86.

一个更好的建议:请勿使用all.knots = TRUE

A better suggestion: Do not use all.knots = TRUE

看，您有1000个数据.通过all.knots = TRUE，您将使用1000个参数.希望以df = 3结尾表示已抑制1000个参数中的997个.想象一下，λ因此需要多少spar！

Look, you have 1000 data. With all.knots = TRUE you will use 1000 parameters. Wishing to end up with df = 3 implies that 997 out of 1000 parameters are suppressed. Imagine how large a λ hence spar you need!

尝试改用惩罚性回归样条曲线.将200个参数抑制为3个绝对容易得多:

Try using penalized regression spline instead. Suppressing 200 parameters to 3 is definitely much easier:

fit <- smooth.spline(x, y, nknots = 200, df=3)  ## using 200 knots
# Smoothing Parameter  spar= 1.317883  lambda= 0.9853648 (16 iterations)
# Equivalent Degrees of Freedom (Df): 3.000386

现在，在没有spar控件的情况下，您将得到df = 3.

Now, you end up with df = 3 without spar control.

这篇关于smooth.spline():拟合的模型与用户指定的自由度不匹配的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！