问题描述
我在ggplot2中使用stat_smooth函数,决定我想要拟合优度",并为此使用了mgvc gam.我想到我应该检查以确保它们是相同的模型(stat_smooth与mgvc gam),因此我使用下面的代码进行检查.似乎它们有不同的结果,如该图所证明(
您始终可以使用 ggplot_build
来查看绘图对象,并查看构成它的所有部分(我不在这里显示结果,因为它占用了很多空间,但是输出将打印到您的控制台).
ggplot_build(p1)
stat_smooth
的预测数据集是数据集列表中的第二个.
ggplot_build(p1)$ data [[2]]
但是看看该数据集有多少行:
nrow(ggplot_build(p1)$ data [[2]])[1] 80
n
参数的默认设置为80,但是数据集中有365行.那么,如果将 n
更改为365,会发生什么呢?我将使平滑线更胖,以便您实际上可以看到它(蓝色).
(p2 = ggplot(dat,aes(x,y))+geom_point()+geom_smooth(方法="gam",公式= y〜s(x,k = 100),n = 365,大小= 2)+geom_line(aes(y = predgam)))
nrow(ggplot_build(p2)$ data [[2]])[1] 365
如果查看 stat_smooth
帮助页面的详细信息"部分中提到的 predictdf
函数的代码,您会发现在以下情况下未使用原始数据集:做出预测.取而代之的是从原始的解释变量中得出一个序列.当使用小的数据集时,这是非常重要的,并且您需要更多的预测点才能使线看起来平滑.但是,在您的情况下,原始数据集已经是 x
的良好平滑序列,因此使用 n = 365
可以从 stat_smooth
中获得相同的预测原始数据集确实如此.
您可以看到 predictdf
的代码此处.
I was using the stat_smooth function in ggplot2, decided I wanted the "goodness of fit", and used mgvc gam for that. It occurred to me that I should check to make sure that they were the same model (stat_smooth vs mgvc gam), so I used the code below to check. Seemingly, they have different results, as evidenced by the plot (Plot: stat_smoother gam (red), mgcv gam (black)). However, I don't know why they have different results. Is it that some default parameter is different between the two? Is is that gam is being run on a numeric x and stat_smooth is being run with a POSIXct x (if so - I don't know what to do about that)? It looks like stat_smooth is smoother, but the k values are the same...
I think there are several posts on how to plot gam outputs in ggplot2, but I'd really like to know why stat_smooth and mgcv are giving different results in the first place. I am very new to GAM (and R), so it's quite possible I'm missing something easy. However, I did google and search this forum before asking.
My data is a bit big to easily share, so I used a sample dataset - I've put the source in the code, as well as a dput()
below everything, and my sessionInfo()
after that.
I have tried to make a quality question, but it is only my second one. Ever. So, constructive criticism is appreciated.
Thank you!
library(readxl)
library(data.table)
library(ggplot2)
library(scales)
library(mgcv)
stackOF_data <- read_excel("mean-daily-flow-cumecs-vatnsdals.xlsx", sheet = "Data")
stackOF_data <- data.table(stackOF_data)
stackOF_data <- stackOF_data[,.(timeseries=as.POSIXct(Date,format("%Y-%m-%d")),mdf)]
a <- stackOF_data[,.(x=as.numeric(timeseries),y=mdf)]
a1 <- gam(y~s(x, k=100, bs="cs"),data=a)
a2=data.table(gam_mdf= predict(a1,a))
a2=cbind(timeseries=stackOF_data$timeseries,a2)
# see if predict and actual are the same
p <- ggplot() +
geom_line(data = a2, aes(x = timeseries, y = gam_mdf), size=1)+
scale_color_manual(values=c("black","magenta"))+
scale_y_continuous()+
scale_x_datetime(labels = date_format("%Y-%m-%d"), breaks = "1 month", minor_breaks = "1 week")+
theme(axis.text.x=element_text(angle=50, size=10,hjust=1))+
stat_smooth(data = stackOF_data, aes(x = (timeseries), y = mdf),method="gam", formula=y~s(x,k=100, bs="cs"), col="red", se=FALSE, size=1)
p
# data from: https://datamarket.com/data/set/235m/mean-daily-flow-cumecs-vatnsdalsa-river-1-jan-1972-31-dec-1974#!ds=235m&display=line&s=14l
> dput(a)
structure(list(x = c(126230400, 126316800, 126403200, 126489600,
126576000, 126662400, 126748800, 126835200, 126921600, 127008000,
127094400, 127180800, 127267200, 127353600, 127440000, 127526400,
127612800, 127699200, 127785600, 127872000, 127958400, 128044800,
128131200, 128217600, 128304000, 128390400, 128476800, 128563200,
128649600, 128736000, 128822400, 128908800, 128995200, 129081600,
129168000, 129254400, 129340800, 129427200, 129513600, 129600000,
129686400, 129772800, 129859200, 129945600, 130032000, 130118400,
130204800, 130291200, 130377600, 130464000, 130550400, 130636800,
130723200, 130809600, 130896000, 130982400, 131068800, 131155200,
131241600, 131328000, 131414400, 131500800, 131587200, 131673600,
131760000, 131846400, 131932800, 132019200, 132105600, 132192000,
132278400, 132364800, 132451200, 132537600, 132624000, 132710400,
132796800, 132883200, 132969600, 133056000, 133142400, 133228800,
133315200, 133401600, 133488000, 133574400, 133660800, 133747200,
133833600, 133920000, 134006400, 134092800, 134179200, 134265600,
134352000, 134438400, 134524800, 134611200, 134697600, 134784000,
134870400, 134956800, 135043200, 135129600, 135216000, 135302400,
135388800, 135475200, 135561600, 135648000, 135734400, 135820800,
135907200, 135993600, 136080000, 136166400, 136252800, 136339200,
136425600, 136512000, 136598400, 136684800, 136771200, 136857600,
136944000, 137030400, 137116800, 137203200, 137289600, 137376000,
137462400, 137548800, 137635200, 137721600, 137808000, 137894400,
137980800, 138067200, 138153600, 138240000, 138326400, 138412800,
138499200, 138585600, 138672000, 138758400, 138844800, 138931200,
139017600, 139104000, 139190400, 139276800, 139363200, 139449600,
139536000, 139622400, 139708800, 139795200, 139881600, 139968000,
140054400, 140140800, 140227200, 140313600, 140400000, 140486400,
140572800, 140659200, 140745600, 140832000, 140918400, 141004800,
141091200, 141177600, 141264000, 141350400, 141436800, 141523200,
141609600, 141696000, 141782400, 141868800, 141955200, 142041600,
142128000, 142214400, 142300800, 142387200, 142473600, 142560000,
142646400, 142732800, 142819200, 142905600, 142992000, 143078400,
143164800, 143251200, 143337600, 143424000, 143510400, 143596800,
143683200, 143769600, 143856000, 143942400, 144028800, 144115200,
144201600, 144288000, 144374400, 144460800, 144547200, 144633600,
144720000, 144806400, 144892800, 144979200, 145065600, 145152000,
145238400, 145324800, 145411200, 145497600, 145584000, 145670400,
145756800, 145843200, 145929600, 146016000, 146102400, 146188800,
146275200, 146361600, 146448000, 146534400, 146620800, 146707200,
146793600, 146880000, 146966400, 147052800, 147139200, 147225600,
147312000, 147398400, 147484800, 147571200, 147657600, 147744000,
147830400, 147916800, 148003200, 148089600, 148176000, 148262400,
148348800, 148435200, 148521600, 148608000, 148694400, 148780800,
148867200, 148953600, 149040000, 149126400, 149212800, 149299200,
149385600, 149472000, 149558400, 149644800, 149731200, 149817600,
149904000, 149990400, 150076800, 150163200, 150249600, 150336000,
150422400, 150508800, 150595200, 150681600, 150768000, 150854400,
150940800, 151027200, 151113600, 151200000, 151286400, 151372800,
151459200, 151545600, 151632000, 151718400, 151804800, 151891200,
151977600, 152064000, 152150400, 152236800, 152323200, 152409600,
152496000, 152582400, 152668800, 152755200, 152841600, 152928000,
153014400, 153100800, 153187200, 153273600, 153360000, 153446400,
153532800, 153619200, 153705600, 153792000, 153878400, 153964800,
154051200, 154137600, 154224000, 154310400, 154396800, 154483200,
154569600, 154656000, 154742400, 154828800, 154915200, 155001600,
155088000, 155174400, 155260800, 155347200, 155433600, 155520000,
155606400, 155692800, 155779200, 155865600, 155952000, 156038400,
156124800, 156211200, 156297600, 156384000, 156470400, 156556800,
156643200, 156729600, 156816000, 156902400, 156988800, 157075200,
157161600, 157248000, 157334400, 157420800, 157507200, 157593600,
157680000), y = c(4.65, 4.65, 4.65, 4.48, 5.16, 5.52, 5.34, 5.34,
4.82, 4.65, 4.48, 4.31, 4.31, 4.31, 4.14, 3.82, 3.98, 3.98, 4.31,
5.71, 6.5, 6.3, 5.71, 5.71, 5.16, 4.65, 4.14, 3.98, 4.48, 4.48,
4.31, 4.65, 4.31, 3.98, 3.98, 3.98, 3.98, 3.98, 3.98, 3.82, 3.67,
3.67, 3.98, 3.98, 3.82, 3.82, 3.82, 4.14, 5.9, 4.48, 3.98, 3.98,
3.82, 3.67, 3.67, 3.67, 4.65, 3.98, 4.31, 4.31, 3.67, 4.31, 6.1,
7.3, 7.5, 7.5, 8.14, 10.8, 16.1, 14.8, 12.5, 9.9, 8.14, 6.9,
6.1, 5.34, 5.16, 4.99, 4.99, 4.99, 4.99, 5.52, 6.3, 7.3, 6.9,
5.9, 5.71, 5.71, 8.58, 31.5, 33.7, 18.4, 11.3, 16.1, 32.9, 45.3,
54, 25.7, 18, 15.9, 15.6, 14.5, 15.9, 35.9, 37.5, 29.4, 27.5,
30.1, 27.5, 30.8, 29.4, 22, 20.1, 35.9, 36.7, 32.9, 22, 18, 15.9,
15.2, 14.8, 13, 12.7, 12.5, 11, 9.68, 8.8, 7.92, 7.3, 6.9, 7.3,
10.3, 11, 11.3, 11.9, 12.5, 13.6, 12.2, 10.8, 9.9, 9.46, 8.8,
7.5, 7.1, 7.71, 7.1, 6.1, 5.34, 5.34, 5.34, 5.52, 5.52, 6.3,
6.7, 6.5, 5.9, 5.71, 5.9, 5.71, 5.52, 7.3, 7.5, 7.1, 7.3, 6.7,
6.9, 7.3, 7.5, 10.8, 11.6, 8.58, 7.92, 7.1, 6.7, 6.5, 6.1, 5.9,
5.9, 5.71, 5.52, 5.52, 5.52, 5.9, 5.9, 5.71, 5.52, 5.52, 5.34,
5.34, 5.52, 6.5, 6.5, 5.71, 5.34, 5.16, 4.99, 4.82, 4.82, 4.99,
4.82, 4.82, 4.82, 4.82, 4.82, 4.65, 4.48, 4.48, 4.31, 4.31, 4.14,
4.14, 4.31, 4.48, 4.31, 4.31, 4.31, 4.99, 5.71, 6.3, 6.1, 6.1,
5.9, 5.71, 5.52, 5.52, 5.52, 5.52, 5.52, 5.34, 5.34, 5.52, 5.52,
5.52, 5.34, 5.34, 5.52, 5.34, 5.52, 5.52, 5.34, 5.34, 5.34, 5.34,
5.71, 5.9, 6.3, 6.9, 7.5, 6.5, 6.1, 6.1, 5.9, 6.1, 6.1, 5.9,
6.5, 6.5, 6.1, 5.9, 5.9, 5.71, 5.9, 5.9, 5.71, 4.99, 4.65, 5.16,
5.34, 5.34, 4.65, 4.99, 5.71, 5.34, 5.34, 5.34, 5.34, 4.99, 5.34,
5.34, 5.34, 5.34, 5.52, 5.34, 5.52, 5.71, 6.5, 7.71, 6.9, 6.5,
6.7, 6.1, 5.9, 6.1, 5.9, 5.71, 7.92, 7.71, 7.1, 7.92, 5.34, 5.16,
8.14, 10.1, 7.92, 7.3, 6.9, 6.9, 6.9, 8.58, 7.3, 6.9, 7.3, 6.3,
5.16, 6.1, 5.52, 4.99, 5.34, 5.34, 5.34, 5.16, 5.71, 5.52, 5.52,
5.16, 4.82, 5.52, 6.1, 5.9, 5.71, 5.52, 5.16, 4.99, 4.48, 4.82,
5.16, 5.16, 5.16, 5.16, 5.16, 4.82, 4.65, 3.82, 4.14, 4.65, 4.65,
4.31, 4.31, 5.16, 5.16, 5.16, 5.16, 5.16, 4.99, 4.65, 5.16, 5.16,
5.16, 5.16, 5.16, 5.16, 5.16, 5.16, 5.34, 5.34)), .Names = c("x",
"y"), row.names = c(NA, -365L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x0000000005860788>)
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.9.6 readxl_0.1.0 mgcv_1.8-7 nlme_3.1-121 scales_0.3.0 sos_1.3-8 brew_1.0-6 ggplot2_1.0.1
[9] MASS_7.3-43
loaded via a namespace (and not attached):
[1] Rcpp_0.12.1 lattice_0.20-33 digest_0.6.8 chron_2.3-47 grid_3.2.2 plyr_1.8.3 gtable_0.1.2 magrittr_1.5
[9] stringi_0.5-5 reshape2_1.4.1 Matrix_1.2-2 labeling_0.3 proto_0.3-10 tools_3.2.2 stringr_1.0.0 munsell_0.4.2
[17] colorspace_1.2-6
Partial Solution
I still don't really know why the two methods are giving different answers, and that bothers me. However, after much internet searching, I did find the following workaround:
library(readxl)
library(data.table)
library(ggplot2)
library(scales)
library(mgcv)
stackOF_data <- read_excel("C:/Users/jel4049/Desktop/mean-daily-flow-cumecs- vatnsdals.xlsx", sheet = "Data")
stackOF_data <- data.table(stackOF_data)
stackOF_data <- stackOF_data[,.(timeseries=as.POSIXct(Date,format("%Y-%m-%d")),mdf)]
a <- stackOF_data[,.(x=as.numeric(timeseries),y=mdf)]
a1 <- gam(y~s(x, k=100, bs="cs"),data=a)
a2=data.table(gam_mdf = predict(a1,a))
preds <- predict(a1,se.fit=TRUE)
my_data <- data.frame(mu=preds$fit, low =(preds$fit - 1.96 * preds$se.fit), high = (preds$fit + 1.96 * preds$se.fit))
m <- ggplot()+
geom_line(data = a2, aes(x=stackOF_data$timeseries, y=gam_mdf), size=1, col="blue")+
geom_smooth(data=my_data,aes(ymin = low, ymax = high, x=stackOF_data$timeseries, y = mu), stat = "identity", col="green")
m
Now at least I know that the summary and data fit quality info I can get from some of the mgcv functions match my plots.
It turns out the differences you were seeing was because you were using the default for the n
argument in stat_smooth
.
From the help page:
Of course, it didn't jump out at me right away that this meant n
controls the size of the dataset for the newdata
argument in predict
and therefore stat_smooth
doesn't use the original dataset when making the predictions. But I was reading this nice answer on a different stat_smooth
question and realized that to figure out what was going on I should take a closer look at the stat_smooth
predictions vs manual predictions from a fitted gam
model.
So, using your dataset from your OP, which I named dat
, we can check what's going on.
The plot when k = 100
, after fitting the model via gam
and adding the predictions to the dataset. As you noted, the blue (stat_smooth
) and black (manual predictions) don't match.
dat$predgam = predict(gam(y ~ s(x, k = 100), data = dat))
(p1 = ggplot(dat, aes(x, y)) +
geom_point() +
geom_smooth(method = "gam", formula = y ~ s(x, k = 100)) +
geom_line(aes(y = predgam)))
You can always use ggplot_build
to look at your plot object and see all the pieces that make it up (I'm not showing the results here because it takes up so much space, but the output will print to your Console).
ggplot_build(p1)
The prediction dataset for stat_smooth
is the second in the list of datasets.
ggplot_build(p1)$data[[2]]
But look how many rows that dataset has:
nrow(ggplot_build(p1)$data[[2]])
[1] 80
The default setting for the n
argument is 80, but you have 365 rows in your dataset. So what happens if you change n
to 365? I'll make the smooth line fatter so you can actually see it (in blue).
(p2 = ggplot(dat, aes(x, y)) +
geom_point() +
geom_smooth(method = "gam", formula = y ~ s(x, k = 100), n = 365, size = 2) +
geom_line(aes(y = predgam)))
nrow(ggplot_build(p2)$data[[2]])
[1] 365
If you look at the code for the predictdf
function mentioned in the Details section of the stat_smooth
help page you'll see that the original dataset isn't used when making predictions. Instead, a sequence is made from the original explanatory variable. This is something that can be really important when working with a small dataset and you need more prediction points in order for the line to look smooth. In your case, though, the original dataset is already a nice smooth sequence of x
so using n = 365
gets the same predictions from stat_smooth
as the original dataset does.
You can see the code for predictdf
here.
这篇关于stat_smooth gam与gam {mgcv}不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!