问题描述
我最近刚开始在 R 中使用随机森林包.在我的森林成长之后,我尝试使用相同的数据集(即训练数据集)预测响应,这给了我一个与之前不同的混淆矩阵与森林对象本身一起打印.我认为 newdata 参数可能有问题,但我按照文档中给出的示例 t 并给出了同样的问题.这是使用 Species 数据集的示例.这是作者在他们的文档中使用的相同示例,除了我使用相同的数据集来训练和预测......所以这里的问题是:为什么这两个混淆矩阵不相同?
I've just recently started playing around with the random forest package in R. After growing my forest, I tried predicting the response using the same dataset (ie the training dataset) which gave me a confusion matrix different from the one that was printed with the forest object itself. I thought there might be something wrong with the newdata argument but I followed the example given in the documentation to the t and it gave the same problem. Here's an example using the Species dataset. this is the same example the authors used in their documentation, except I use the same dataset to train and predict...So the question here is: why are those two confusion matrices not identical?
data(iris)
set.seed(111)
ind <- sample(2, nrow(iris), replace = TRUE, prob=c(0.8, 0.2))
#grow forest
iris.rf <- randomForest(Species ~ ., data=iris[ind == 1,])
print(iris.rf)
Call:
randomForest(formula = Species ~ ., data = iris[ind == 1, ])
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2
OOB estimate of error rate: 3.33%
Confusion matrix:
setosa versicolor virginica class.error
setosa 45 0 0 0.00000000
versicolor 0 39 1 0.02500000
virginica 0 3 32 0.08571429
#predict using the training again...
iris.pred <- predict(iris.rf, iris[ind == 1,])
table(observed = iris[ind==1, "Species"], predicted = iris.pred)
predicted
observed setosa versicolor virginica
setosa 45 0 0
versicolor 0 40 0
virginica 0 0 35
推荐答案
您会注意到,在第一个总结中,混淆矩阵被标记为 OOB 估计
.
You'll note that in the first summary, the confusion matrix is labelled the OOB estimate
.
这代表 Out-of-Bag,与直接预测森林上训练集中的每个观察结果不同.后者显然是对准确性的有偏见的估计,OOB 估计则不那么准确(尽管 OOB 也有批评者;但它至少 更 合理).
This stands for Out-of-Bag, and is not the same as directly predicting each observation in the training set on the forest. The latter will obviously be a biased estimate of accuracy, the OOB estimate less so (although OOB has it's critics as well; but it's at least more reasonable).
基本上,当您打印摘要本身时,它会记录每个观察结果,并且仅在未使用它的树上进行测试,即out of bag".因此,OOB 预测基本上仅使用您森林中树木的一个子集(一般约为 2/3).
Basically, when you print the summary itself, it is taking each observation and only testing it on the trees on which it was not used, i.e. "out of bag". So the OOB predictions are essentially using only a subset of the trees in your forest (roughly 2/3 in general).
当您直接对训练数据调用 predict 时,它使用的是树,其中每个观察值实际上都用于构建树,因此该版本使每个观察值都正确也就不足为奇了,而 OOB 版本有一些错误分类.
When you call predict on the training data directly, it is using trees where each observation was actually used in the tree construction, so it's not surprising that that version gets each one correct, while the OOB version has some misclassified.
这篇关于随机森林包预测,新数据参数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!