R smooth.spline():平滑样条不是平滑的，但是过度拟合了我的数据

本文介绍了R smooth.spline():平滑样条不是平滑的，但是过度拟合了我的数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有几个数据点似乎适合于通过它们拟合样条曲线.当我这样做时，我得到了一个非常坎fit的拟合，例如过度拟合，这不是我所理解的平滑.

I have several data points which seem suitable for fitting a spline through them. When I do this, I get a rather bumpy fit, like overfitting, which is not what I understand as smoothing.

是否有特殊的选项/参数来恢复像此处的真正平滑的样条曲线的功能.

Is there a special option / parameter for getting back the function of a really smooth spline like here.

smooth.spline的penalty参数的用法没有任何可见效果.也许我做错了吗?

The usage of the penalty parameter for smooth.spline didn't have any visible effect. Maybe I did it wrong?

以下是数据和代码:

results <- structure(
    list(
        beta = c(
            0.983790622281964, 0.645152464354322,
            0.924104713597375, 0.657703886566088, 0.788138034115623, 0.801080207252363,
            1, 0.858337365965949, 0.999687052533693, 0.666552625121279, 0.717453633245958,
            0.621570152961453, 0.964658181346544, 0.65071758770312, 0.788971505000918,
            0.980476054183113, 0.670263506919246, 0.600387040967624, 0.759173403408052,
            1, 0.986409675965, 0.982996471134736, 1, 0.995340781899163, 0.999855895958986,
            1, 0.846179233381267, 0.879226324448832, 0.795820998892035, 0.997586607285667,
            0.848036806290156, 0.905320944437968, 0.947709125535428, 0.592172373022407,
            0.826847031044922, 0.996916006944244, 0.785967729206612, 0.650346929853076,
            0.84206351833549, 0.999043126652724, 0.936879214753098, 0.76674066557003,
            0.591431233516217, 1, 0.999833445117791, 0.999606223666537, 0.6224971799303,
            1, 0.974537160571494, 0.966717133936379
        ), inventoryCost = c(
            1750702.95138889,
            442784.114583333, 1114717.44791667, 472669.357638889, 716895.920138889,
            735396.180555556, 3837320.74652778, 872873.4375, 2872414.93055556,
            481095.138888889, 538125.520833333, 392199.045138889, 1469500.95486111,
            459873.784722222, 656220.486111111, 1654143.83680556, 437511.458333333,
            393295.659722222, 630952.170138889, 4920958.85416667, 1723517.10069444,
            1633579.86111111, 4639909.89583333, 2167748.35069444, 3062420.65972222,
            5132702.34375, 838441.145833333, 937659.288194444, 697767.1875,
            2523016.31944444, 800903.819444444, 1054991.49305556, 1266970.92013889,
            369537.673611111, 764995.399305556, 2322879.6875, 656021.701388889,
            458403.038194444, 844133.420138889, 2430700, 1232256.68402778,
            695574.479166667, 351348.524305556, 3827440.71180556, 3687610.41666667,
            2950652.51736111, 404550.78125, 4749901.64930556, 1510481.59722222,
            1422708.07291667
        )
    ), .Names = c("beta", "inventoryCost"), class = c("data.frame")
)

plot(results$beta,results$inventoryCost)
mySpline <- smooth.spline(results$beta,results$inventoryCost, penalty=999999)
lines(mySpline$x, mySpline$y, col="red", lwd = 2)

推荐答案

在建模之前合理地转换数据

根据您的results$inventoryCost规模，对数转换是合适的.为简单起见，以下我使用x，y.我也在重新排序您的数据，以使x升序:

Based on the scale of your results$inventoryCost, log transform is appropriate. For simplicity, in the following I am using x, y. I am also reordering your data so that x is ascending:

x <- results$beta; y <- log(results$inventoryCost)
reorder <- order(x); x <- x[reorder]; y <- y[reorder]

par(mfrow = c(1,2))
plot(x, y, main = "take log transform")
hist(x, main = "x is skewed")

左图看起来更好吗?另外，强烈建议对x进行进一步转换，因为它偏斜了！ (请参见右图).

The left figure looks better? Also, it is highly recommended to further take transform for x, because it is skewed! (see right figure).

以下转换是合适的:

x1 <- -(1-x)^(1/3)

(1-x)的立方根将使数据在x = 1周围更加分散.我添加了另一个-1，以便在x和x1之间存在正单调关系，而不是负关系.现在让我们检查一下关系:

The cubic root of (1-x) will make data more spread out around x = 1. I put an additional -1 so that there is a positively monotonic relation rather than a negative one between x and x1. Now let's check the relationship:

par(mfrow = c(1,2))
plot(x1, y, main = expression(y %~% ~ x1))
hist(x1, main = "x1 is well spread out")

拟合样条线

现在，我们可以进行统计建模了.尝试以下呼叫:

Now we are ready for statistical modelling. Try the following call:

fit <- smooth.spline(x1, y, nknots = 10)
pred <- stats:::predict.smooth.spline(fit, x1)$y  ## predict at all x1
## or you can simply call: pred <- predict(fit, x1)$y
plot(x1, y)  ## scatter plot
lines(x1, pred, lwd = 2, col = 2)  ## fitted spline

看起来不错吗?请注意，我已经使用nknots = 10告诉smooth.spline放置10个内部结(按分位数)；因此，我们要拟合惩罚回归样条线，而不是平滑样条线.实际上，除非您放置all.knots = TRUE(请参阅后面的示例)，否则smooth.spline()函数几乎永远不会适合平滑样条线.

Does it look nice? Note, that I have used nknots = 10 tells smooth.spline to place 10 interior knots (by quantile); Therefore, we are to fit a penalized regression spline rather than a smoothing spline. In fact, the smooth.spline() function almost never fit a smoothing spline, unless you put all.knots = TRUE (see later example).

我也放弃了penalty = 999999，因为这与平滑度控制无关.如果您真的想控制平滑度，而不是让smooth.spline通过GCV找出最佳的平滑度，则应该使用参数df或spar.我待会再举个例子.

I also dropped penalty = 999999, as that has nothing to do with smoothness control. If you really want to control smoothness, rather than letting smooth.spline figure out the optimal one by GCV, you should use argument df or spar. I will give example later.

要将拟合重新转换为原始比例，请执行以下操作:

To transform fit back to original scale, do:

plot(x, exp(y), main = expression(Inventory %~%~ beta))
lines(x, exp(pred), lwd = 2, col = 2)

如您所见，拟合的样条曲线与您期望的一样平滑.

As you can see, the fitted spline is as smooth as you had expected.

拟合样条的解释

让我们看看您拟合的样条线的摘要:

Let's see the summary of your fitted spline:

> fit

Smoothing Parameter  spar= 0.4549062  lambda= 0.0008657722 (11 iterations)
Equivalent Degrees of Freedom (Df): 6.022959
Penalized Criterion: 0.08517417
GCV: 0.004288539

我们使用了10个结，以6个自由度结束，因此惩罚可以抑制大约4个参数.经过11次迭代后，选择的平滑参数GCV为lambda= 0.0008657722.

We used 10 knots, ending up with 6 degree of freedom, so penalization suppresses about 4 parameters. The smoothing parameter GCV has chosen, after 11 iterations, is lambda= 0.0008657722.

我们为什么必须将x转换为x1

Why do we have to transform x to x1

样条曲线受到二阶导数的惩罚，但是这种惩罚对所有数据点的平均/积分二阶导数.现在，查看您的数据(x, y).对于0.98之前的x，该关系相对稳定.当x接近1时，关系迅速变陡. 变化点" 0.98具有很高的二阶导数，远高于其他位置的二阶导数.

Spline is penalized by 2nd derivatives, yet such penalization is on the averaged/integrated 2nd derivatives at all data points. Now, look at your data (x, y). For x before 0.98, the relationship is relatively steady; as x approaches 1, the relationship quickly goes steeper. The "change point", 0.98, has very high second derivative, much much higher than the second derivatives at other locations.

y0 <- as.numeric(tapply(y, x, mean))  ## remove tied values
x0 <- unique(x)  ## remove tied values
dy0 <- diff(y0)/diff(x0)  ## 1st order difference
ddy0 <- diff(dy0)/diff(x0[-1])  ## 2nd order difference
plot(x0[1:43], abs(ddy0), pch = 19)

看看二阶差分/导数的巨大峰值！现在，如果我们直接拟合样条曲线，则围绕此更改点的样条曲线将受到严重惩罚.

bad <- smooth.spline(x, y, all.knots = TRUE)
bad.pred <- predict(bad, x)$y
plot(x, exp(y), main = expression(Inventory %~% ~ beta))
lines(x, exp(bad.pred), col = 2, lwd = 3)
abline(v = 0.98, lwd = 2, lty = 2)

您可以清楚地看到，在x = 0.98之后，样条曲线难以逼近数据.

You can see clearly that the spline is having some difficulty in approximating data after x = 0.98.

当然，有一些方法可以在此更改点之后获得更好的逼近度，例如，通过手动设置较小的平滑参数或较高的自由度.但是，我们将走向另一个极端.请记住，惩罚和自由度都是全球措施.在x = 0.98之后，模型复杂度的增加将得到更好的近似，但同时也会使其他部分变得更加坎bump.现在，让我们尝试一个具有45个自由度的模型:

There are of course some ways to achieve better approximation after this change point, for example, by manually setting smaller smoothing parameter, or higher degree of freedom. But we are going to another extreme. Remember, both penalization and degree of freedom are a global measure. Increasing model complexity will get better approximation after x = 0.98, but will also make other parts more bumpy. Now let's try a model with 45 degree of freedom:

worse <- smooth.spline(x, y, all.knots = TRUE, df = 45)
worse.pred <- predict(worse, x)$y
plot(x, exp(y), main = expression(Inventory %~% ~ beta))
lines(x, exp(worse.pred), col = 2, lwd = 2)

如您所见，曲线凹凸不平.当然，我们已经过拟合了50个数据集和45个自由度.

As you can see, the curve is bumpy. Sure, we have overfitted our dataset of 50 data, with 45 degree of freedom.

实际上，您最初对smooth.spline()的滥用是在做同一件事:

In fact, your original misuse of smooth.spline() is doing the same thing:

> mySpline
Call:
smooth.spline(x = results$beta, y = results$inventoryCost, penalty = 999999)

Smoothing Parameter  spar= -0.8074624  lambda= 3.266077e-19 (17 iterations)
Equivalent Degrees of Freedom (Df): 45
Penalized Criterion: 5.598386
GCV: 0.03824885

糟糕，自由度为45，过拟合！

Oops, 45 degree of freedom, overfitting!

这篇关于R smooth.spline():平滑样条不是平滑的，但是过度拟合了我的数据的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！