问题描述
我正在尝试使用 randomForest 和 caret 包预测栅格层,但在引入因子变量时失败.没有因素,一切正常,但一旦我引入一个因素,我就会得到错误:
I am trying to predict a raster layer with randomForest and the caret package, but fail when I introduce factor variables. Without factors, everything works fine, but as soon as I bring a factor in, I get the error:
predict.randomForest(modelFit, newdata) 中的错误:新数据中的预测变量类型与训练数据的类型不匹配.
我在下面创建了一些示例代码来演示他的过程.为了透明起见,我将其分为几个步骤并提供一个工作示例.
I have created some sample code below that walks through he process. I present it in a few steps for transparency and to provide a working example.
(要跳过设置代码,从这里往下跳...)
首先是创建样本数据、拟合 RF 模型和预测不涉及任何因素的栅格.一切正常.
First is a creating sample data, fitting RF models, and predicting raster with NO factors involved. Everything works fine.
# simulate data
x1p <- runif(50, 10, 20) # presence
x2p <- runif(50, 100, 200)
x1a <- runif(50, 15, 25) # absence
x2a <- runif(50, 180, 400)
x1 <- c(x1p, x1a)
x2 <- c(x2p,x2a)
y <- c(rep(1,50), rep(0,50)) # presence/absence
d <- data.frame(x1 = x1, x2 = x2, y = y)
# RF Classification on data with no factors... works fine
require(randomForest)
dRF <- d
dRF$y <- factor(ifelse(d$y == 1, "present", "absent"),
levels = c("present", "absent"))
rfFit <- randomForest(y = dRF$y, x = dRF[,1:2], ntree=100) # RF Classfication
# Create sample Rasters
require(raster)
r1 <- r2 <- raster(nrow=100, ncol=100)
values(r1) <- runif(ncell(r1), 5, 25 )
values(r2) <- runif(ncell(r2), 85, 500 )
s <- stack(r1, r2)
names(s) <- c("x1", "x2")
# raster::predict() with no factors, works fine.
model <- predict(s, rfFit, na.rm=TRUE, type="prob", progress='text')
spplot(model)
接下来的步骤是创建一个因子变量以添加到训练数据中,并为预测创建一个具有匹配值的栅格.请注意,栅格是常规的旧整数,而不是 as.factor
栅格.一切仍然正常...
The next steps are creating a factor variable to add to the training data and creating a raster with matching values for the prediction. Note that the raster is a regular old integer, not a as.factor
raster. Everything still works fine...
# Create factor variable
x3p <- sample(0:5, 50, replace=T)
x3a <- sample(3:7, 50, replace=T)
x3 <- c(x3p, x3a)
dFac <- dRF
dFac$x3 <- as.factor(x3)
dFac <- dFac[,c(1,2,4,3)] # reorder
# RF model with factors, works fine
rfFit2 <- randomForest(y ~ x1 + x2 + x3, data=dFac, ntree=100)
# Create new raster, but not as.factor()
r3 <- raster(nrow=100, ncol=100)
values(r3) <- sample(0:7, ncell(r3), replace=T)
s2 <- stack(s, r3)
names(s2) <- c("x1", "x2", "x3")
s2 <- brick(s2) # brick or stack, either work
# RF, raster::predict() from fit with factor
f <- levels(dFac$x3) # included, but not necessary
model2 <- predict(s2, rfFit2, type="prob",
progress='text', factors=f, index=1:2)
spplot(model2) # works fine
在上述步骤之后,我现在有一个 RF 模型,该模型使用包含因子变量的数据进行训练,并在包含相似值的整数栅格的栅格砖上进行预测.这是我的最终目标,但我希望能够通过 caret
包工作流来实现.下面我介绍 caret::train()
没有任何因素,一切都很好.
After the above steps, I now have a RF model that is trained with data including a factor variable and predicted on a raster brick that contains an integer raster of like values. That is my end goal, but I want to be able to do it through the caret
package workflow. Below I introduce caret::train()
with no factors and all works well.
# RF with Caret and NO factors
require(caret)
rf_ctrl <- trainControl(method = "cv", number=10,
allowParallel=FALSE, verboseIter=TRUE,
savePredictions=TRUE, classProbs=TRUE)
cFit1 <- train(y = dRF$y, x = dRF[,1:2], method = "rf",
tuneLength=4, trControl = rf_ctrl, importance = TRUE)
model3 <- predict(s2, cFit1, type="prob",
progress='text', factors=f, index=1:2)
spplot(model3) # works with caret and NO factors
(...到这里.这是问题开始的地方)
这就是失败的地方.带有因子变量的插入符号训练的 Rf 模型有效,但在 raster::predict()
处失败.
Here is where things fails. A caret trained Rf model with a a factor variable works, but fails at raster::predict()
.
# RF with Caret and FACTORS
rf_ctrl2 <- trainControl(method = "cv", number=10,
allowParallel=FALSE, verboseIter=TRUE,
savePredictions=TRUE, classProbs=TRUE)
cFit2 <- train(y = dFac$y, x = dFac[,1:3], method = "rf",
tuneLength=4, trControl = rf_ctrl2, importance = TRUE)
model4 <- predict(s2, cFit2, type="prob",
progress='text', factors=f, index=1:2)
# FAIL: "Type of predictors in new data do not match that of the training data."
尝试与上述相同的方法,但不是使用与因子级别具有相同值的整数栅格,而是使用 as.factor()
并分配级别将栅格转换为因子.这也失败了.
Trying the same as above, but instead of an integer raster that has the same values as the factor levels, I make the raster into a factor using as.factor()
and assigning levels. This fails as well.
#trying with raster as.factor()
r3f <- raster(nrow=100, ncol=100)
values(r3f) <- sample(0:7, ncell(r3f), replace=T)
r3f <- as.factor(r3f)
f <- levels(r3f)[[1]]
f$code <- as.character(f[,1])
levels(r3f) <- f
s2f <- stack(s, r3f)
names(s2f) <- c("x1", "x2", "x3")
s2f <- brick(s2f)
model4f <- predict(s2f, cFit2, type="prob",
progress='text', factors=f, index=1:2)
# FAIL "Type of predictors in new data do not match that of the training data."
上述步骤的错误和进展清楚地表明我的方法和 caret:train()
与 raster::predict()
存在问题.我已经完成了调试(尽我所能)并解决了我注意到的问题,但没有确凿的证据.
The error and progression of steps above clearly suggests that there is an issue with my approach and caret:train()
vs. raster::predict()
. I have walked through the debug (to the best of my ability) and addressed issues I noticed, but there was no smoking gun.
任何和所有帮助将不胜感激.谢谢!
Any and all help would be greatly appreciated.Thanks!
添加:我一直在胡思乱想,意识到如果 caret::train()
中的模型是以公式形式编写的,它会起作用.查看模型对象的结构,很容易看出为因子变量创建了对比.我想这也意味着 raster::predict()
可以识别对比.这很好,但很糟糕,因为我的方法没有设置为使用基于公式的预测.任何额外的帮助仍然不胜感激.
Added:I was continuing to mess around realized that it works if the model in caret::train()
is written in formula form. Looking at the structure of the model object, it is easily seen that contrasts are created for the factor variable. I suppose this also means that raster::predict()
recognizes the contrasts. This is good, but a bummer because my methods are not set up to use formula based predictions. Any additional help is still appreciated.
#with Caret WITH FACTORS as model formula!
rf_ctrl3 <- trainControl(method = "cv", number=10,
allowParallel=FALSE, verboseIter=TRUE, savePredictions=TRUE, classProbs=TRUE)
cFit3 <- train(y ~ x1 + x2 + x3, data=dFac, method = "rf",
tuneLength=4, trControl = rf_ctrl2, importance = TRUE)
model5 <- predict(s2, cFit3, type="prob", progress='text') # prediction raster
spplot(model5)
推荐答案
进行了大量测试,但答案是 raster::predict()
仅适用于从 raster::predict()
生成的模型code>caret::train() 包含因子,如果模型表示为公式 (y ~ x1 + x2 + x3
) 而不是 y = y,x = x
(作为矩阵或 data.frame).只有通过公式接口,模型才能创建适当的对比或虚拟变量.无需通过 as.factor()
将您的栅格图层变成因子.预测功能会为您做到这一点.
It took a good bit of testing, but the answer is that raster::predict()
only works with models generated from caret::train()
that contain factors, if the model is presented as a formula (y ~ x1 + x2 + x3
) and not as y = y, x = x
(as a matrix or data.frame). Only through the formula interface will the the model create the proper contrasts or dummy variables. There is no need to make your raster layers into factors via as.factor()
. The predict function will do that for you.
这篇关于使用 randomForest、Caret 和因子变量预测栅格时出错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!