r - smooth.spline(): fitted model does not match user-specified degree of freedom

这是我运行的代码

fun <- function(x) {1 + 3*sin(4*pi*x-pi)}
set.seed(1)
num.samples <- 1000
x <- runif(num.samples)
y <- fun(x) + rnorm(num.samples) * 1.5
fit <- smooth.spline(x, y, all.knots=TRUE, df=3)

尽管使用df=3，当我检查拟合模型时，输出为

Call:
smooth.spline(x = x, y = y, df = 3, all.knots = TRUE)
Smoothing Parameter  spar= 1.499954  lambda= 0.002508571 (26 iterations)
Equivalent Degrees of Freedom (Df): 9.86422

有人可以帮忙吗？谢谢!

最佳答案

请注意，从R-3.4.0(2017-04-21)开始，smooth.spline可以通过新添加的参数λ接受lambda的直接指定。但是在估计过程中仍将转换为内部的一个spar。因此以下答案不受影响。

平滑参数λ/spar位于平滑度控制的中心

平滑度由平滑参数λ控制。 smooth.spline()使用内部平滑参数spar而不是λ:

spar = s0 + 0.0601 * log(λ)

像GCV/CV这样的对数变换对于进行无约束最小化是必要的。用户可以指定spar来间接指定λ。当spar线性增长时，λ将呈指数增长。因此，很少需要使用较大的spar值。

自由度df，也根据λ定义:

r - smooth.spline(): fitted model does not match user-specified degree of freedom-LMLPHP

其中X是基于B样条的模型矩阵，而S是惩罚矩阵。

您可以检查它们与数据集的关系:

spar <- seq(1, 2.5, by = 0.1)
a <- sapply(spar, function (spar_i) unlist(smooth.spline(x, y, all.knots=TRUE, spar = spar_i)[c("df","lambda")]))

让我们素描df ~ spar，λ ~ spar和log(λ) ~ spar:

par(mfrow = c(1,3))
plot(spar, a[1, ], type = "b", main = "df ~ spar",
     xlab = "spar", ylab = "df")
plot(spar, a[2, ], type = "b", main = "lambda ~ spar",
     xlab = "spar", ylab = "lambda")
plot(spar, log(a[2,]), type = "b", main = "log(lambda) ~ spar",
     xlab = "spar", ylab = "log(lambda)")

注意λ和spar的急剧增长，log(λ)和spar之间的线性关系以及df和spar之间的相对平滑的关系。
smooth.spline() 的 spar拟合迭代

如果我们手动指定spar的值，就像我们在sapply()中所做的那样，则选择spar不会进行合适的迭代;否则，smooth.spline()需要遍历许多spar值。要是我们
指定cv = TRUE / FALSE，拟合迭代旨在最小化CV/GCV分数；
指定df = mydf，适合的迭代旨在使(df(spar) - mydf) ^ 2最小化。

最小化GCV很容易遵循。我们不在乎GCV得分，但在乎相应的spar。相反，当最小化(df(spar) - mydf)^2时，我们通常关心的是迭代结束时的df值，而不是spar!但是请记住，这是一个最小化问题，我们永远不能保证最终的df与我们的目标值mydf相匹配。

为什么放df = 3，却得到df = 9.864?

迭代结束可能意味着达到最小值，或者达到搜索边界，或者达到最大迭代次数。

我们离最大迭代限制(默认值500)还差得远；但是我们没有达到最低要求。好吧，我们可能会到达边界。

不要专注于df，而要考虑spar。
smooth.spline(x, y, all.knots=TRUE, df=3)$spar # 1.4999

根据?smooth.spline，默认情况下，smooth.spline()在spar之间搜索[-1.5, 1.5]。即，当您放置df = 3时，最小化将在搜索边界处终止，而不是击中df = 3。

再次看看我们的df和spar之间的关系图。从图中看来，我们需要在2附近有一些spar值才能生成df = 3。

让我们使用control.spar参数:
fit <- smooth.spline(x, y, all.knots=TRUE, df=3, control.spar = list(high = 2.5)) # Smoothing Parameter spar= 1.859066 lambda= 0.9855336 (14 iterations) # Equivalent Degrees of Freedom (Df): 3.000305

现在，您看到的最终是df = 3。我们需要一个spar = 1.86。

更好的建议:请勿使用all.knots = TRUE

看，您有1000个数据。使用all.knots = TRUE，您将使用1000个参数。希望最终使用df = 3表示已抑制1000个参数中的997个。想象一下，您需要的λ和spar多大!

尝试改用惩罚性回归样条曲线。将200个参数抑制为3个绝对容易得多:
fit <- smooth.spline(x, y, nknots = 200, df=3) ## using 200 knots # Smoothing Parameter spar= 1.317883 lambda= 0.9853648 (16 iterations) # Equivalent Degrees of Freedom (Df): 3.000386

现在，您最终得到了没有df = 3控件的spar。
关于r - smooth.spline(): fitted model does not match user-specified degree of freedom，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/36779660/