使用 randomForest、Caret 和因子变量预测栅格时出错

本文介绍了使用 randomForest、Caret 和因子变量预测栅格时出错的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用 randomForest 和 caret 包预测栅格层，但在引入因子变量时失败.没有因素，一切正常，但一旦我引入一个因素，我就会得到错误:

I am trying to predict a raster layer with randomForest and the caret package, but fail when I introduce factor variables. Without factors, everything works fine, but as soon as I bring a factor in, I get the error:

predict.randomForest(modelFit, newdata) 中的错误:新数据中的预测变量类型与训练数据的类型不匹配.

我在下面创建了一些示例代码来演示他的过程.为了透明起见，我将其分为几个步骤并提供一个工作示例.

I have created some sample code below that walks through he process. I present it in a few steps for transparency and to provide a working example.

(要跳过设置代码，从这里往下跳...)

首先是创建样本数据、拟合 RF 模型和预测不涉及任何因素的栅格.一切正常.

First is a creating sample data, fitting RF models, and predicting raster with NO factors involved. Everything works fine.

# simulate data
x1p <- runif(50, 10, 20) # presence
x2p <- runif(50, 100, 200)
x1a <- runif(50, 15, 25) # absence
x2a <- runif(50, 180, 400)
x1 <- c(x1p, x1a)
x2 <- c(x2p,x2a)
y <- c(rep(1,50), rep(0,50)) # presence/absence
d <- data.frame(x1 = x1, x2 = x2, y = y)

# RF Classification on data with no factors... works fine
require(randomForest)
dRF <- d
dRF$y <- factor(ifelse(d$y == 1, "present", "absent"),
                levels = c("present", "absent"))
rfFit <- randomForest(y = dRF$y, x = dRF[,1:2], ntree=100) # RF Classfication

# Create sample Rasters
require(raster)
r1 <- r2 <- raster(nrow=100, ncol=100)
values(r1) <- runif(ncell(r1), 5, 25 )
values(r2) <- runif(ncell(r2), 85, 500 )
s <- stack(r1, r2)
names(s) <- c("x1", "x2")

# raster::predict() with no factors, works fine.
model <- predict(s, rfFit, na.rm=TRUE, type="prob", progress='text')
spplot(model)

接下来的步骤是创建一个因子变量以添加到训练数据中，并为预测创建一个具有匹配值的栅格.请注意，栅格是常规的旧整数，而不是 as.factor 栅格.一切仍然正常...

The next steps are creating a factor variable to add to the training data and creating a raster with matching values for the prediction. Note that the raster is a regular old integer, not a as.factor raster. Everything still works fine...

# Create factor variable
x3p <- sample(0:5, 50, replace=T)
x3a <- sample(3:7, 50, replace=T)
x3 <- c(x3p, x3a)
dFac <- dRF
dFac$x3 <- as.factor(x3)
dFac <- dFac[,c(1,2,4,3)] # reorder

# RF model with factors, works fine
rfFit2 <- randomForest(y ~ x1 + x2 + x3, data=dFac, ntree=100)

# Create new raster, but not as.factor()
r3 <- raster(nrow=100, ncol=100)
values(r3) <- sample(0:7, ncell(r3), replace=T)
s2 <- stack(s, r3)
names(s2) <- c("x1", "x2", "x3")
s2 <- brick(s2) # brick or stack, either work

# RF, raster::predict() from fit with factor
f <- levels(dFac$x3) # included, but not necessary
model2 <- predict(s2, rfFit2,  type="prob",
          progress='text', factors=f, index=1:2)
spplot(model2) # works fine

在上述步骤之后，我现在有一个 RF 模型，该模型使用包含因子变量的数据进行训练，并在包含相似值的整数栅格的栅格砖上进行预测.这是我的最终目标，但我希望能够通过 caret 包工作流来实现.下面我介绍 caret::train() 没有任何因素，一切都很好.

After the above steps, I now have a RF model that is trained with data including a factor variable and predicted on a raster brick that contains an integer raster of like values. That is my end goal, but I want to be able to do it through the caret package workflow. Below I introduce caret::train() with no factors and all works well.

# RF with Caret and NO factors
require(caret)
rf_ctrl <- trainControl(method = "cv", number=10,
           allowParallel=FALSE, verboseIter=TRUE,
           savePredictions=TRUE, classProbs=TRUE)
cFit1 <- train(y = dRF$y, x = dRF[,1:2], method = "rf",
         tuneLength=4, trControl = rf_ctrl, importance = TRUE)
model3 <- predict(s2, cFit1,  type="prob",
          progress='text', factors=f, index=1:2)
spplot(model3) # works with caret and NO factors

(...到这里.这是问题开始的地方)

这就是失败的地方.带有因子变量的插入符号训练的 Rf 模型有效，但在 raster::predict() 处失败.

Here is where things fails. A caret trained Rf model with a a factor variable works, but fails at raster::predict().

# RF with Caret and FACTORS
rf_ctrl2 <- trainControl(method = "cv", number=10,
            allowParallel=FALSE, verboseIter=TRUE,
            savePredictions=TRUE, classProbs=TRUE)
cFit2 <- train(y = dFac$y, x = dFac[,1:3], method = "rf",
         tuneLength=4, trControl = rf_ctrl2, importance = TRUE)
model4 <- predict(s2, cFit2,  type="prob",
          progress='text', factors=f, index=1:2)
# FAIL: "Type of predictors in new data do not match that of the training data."

尝试与上述相同的方法，但不是使用与因子级别具有相同值的整数栅格，而是使用 as.factor() 并分配级别将栅格转换为因子.这也失败了.

Trying the same as above, but instead of an integer raster that has the same values as the factor levels, I make the raster into a factor using as.factor() and assigning levels. This fails as well.

#trying with raster as.factor()
r3f <- raster(nrow=100, ncol=100)
values(r3f) <- sample(0:7, ncell(r3f), replace=T)
r3f <- as.factor(r3f)
f <- levels(r3f)[[1]]
f$code <- as.character(f[,1])
levels(r3f) <- f
s2f <- stack(s, r3f)
names(s2f) <- c("x1", "x2", "x3")
s2f <- brick(s2f)

model4f <- predict(s2f, cFit2,  type="prob",
           progress='text', factors=f, index=1:2)
# FAIL "Type of predictors in new data do not match that of the training data."

上述步骤的错误和进展清楚地表明我的方法和 caret:train() 与 raster::predict() 存在问题.我已经完成了调试(尽我所能)并解决了我注意到的问题，但没有确凿的证据.

The error and progression of steps above clearly suggests that there is an issue with my approach and caret:train() vs. raster::predict(). I have walked through the debug (to the best of my ability) and addressed issues I noticed, but there was no smoking gun.

任何和所有帮助将不胜感激.谢谢！

Any and all help would be greatly appreciated.Thanks!

添加:我一直在胡思乱想，意识到如果 caret::train() 中的模型是以公式形式编写的，它会起作用.查看模型对象的结构，很容易看出为因子变量创建了对比.我想这也意味着 raster::predict() 可以识别对比.这很好，但很糟糕，因为我的方法没有设置为使用基于公式的预测.任何额外的帮助仍然不胜感激.

Added:I was continuing to mess around realized that it works if the model in caret::train() is written in formula form. Looking at the structure of the model object, it is easily seen that contrasts are created for the factor variable. I suppose this also means that raster::predict() recognizes the contrasts. This is good, but a bummer because my methods are not set up to use formula based predictions. Any additional help is still appreciated.

#with Caret WITH FACTORS as model formula!
rf_ctrl3 <- trainControl(method = "cv", number=10,
            allowParallel=FALSE, verboseIter=TRUE, savePredictions=TRUE, classProbs=TRUE)
cFit3 <- train(y ~ x1 + x2 + x3, data=dFac, method = "rf",
            tuneLength=4, trControl = rf_ctrl2, importance = TRUE)

model5 <- predict(s2, cFit3,  type="prob", progress='text') # prediction raster
spplot(model5)

there

使用 randomForest、Caret 和因子变量预测栅格时出错

问题描述

推荐答案