问题描述
我正在参加 Coursera 实用机器学习课程,该课程要求使用此 数据集.在将数据拆分为 training
和 testing
数据集后,基于感兴趣的结果(此处标记为 y
,但实际上是 classe
数据集中的变量):
I'm taking part in the Coursera Practical Machine Learning course, and the coursework requires building predictive models using this dataset. After splitting the data into training
and testing
datasets, based on the outcome of interest (herewith labelled y
, but is in fact the classe
variable in the dataset):
inTrain <- createDataPartition(y = data$y, p = 0.75, list = F)
training <- data[inTrain, ]
testing <- data[-inTrain, ]
我尝试了两种不同的方法:
I have tried 2 different methods:
modFit <- caret::train(y ~ ., method = "rpart", data = training)
pred <- predict(modFit, newdata = testing)
confusionMatrix(pred, testing$y)
对比
modFit <- rpart::rpart(y ~ ., data = training)
pred <- predict(modFit, newdata = testing, type = "class")
confusionMatrix(pred, testing$y)
我认为它们会给出相同或非常相似的结果,因为初始方法加载了 'rpart' 包(向我建议它使用这个包作为方法).但是,计时(caret
慢得多)&结果大不相同:
I would assume they would give identical or very similar results, as the initial method loads the 'rpart' package (suggesting to me it uses this package for the method). However, the timings (caret
much slower) & results are very different:
方法一(插入符号)
:
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 1264 374 403 357 118
B 25 324 28 146 124
C 105 251 424 301 241
D 0 0 0 0 0
E 1 0 0 0 418
方法二(rpart)
:
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 1288 176 14 79 25
B 36 569 79 32 68
C 31 88 690 121 113
D 14 66 52 523 44
E 26 50 20 49 651
如您所见,第二种方法是更好的分类器 - 第一种方法对于 D 类和E.
As you can see, the second approach is a better classifier - the first method is very poor for classes D & E.
我意识到这可能不是问这个问题的最合适的地方,但我真的很感激能更深入地了解这个问题和相关问题.caret
看起来是一个很好的统一方法和调用语法的包,但我现在犹豫要不要使用它.
I realise this may not be the most appropriate place to ask this question, but I would really appreciate a greater understanding of this and related issues. caret
seems like a great package to unify the methods and call syntax, but I'm now hesitant to use it.
推荐答案
caret
实际上做了更多的事情.特别是,它使用交叉验证来优化模型超参数.在您的情况下,它会尝试 cp
的三个值(键入 modFit
,您将看到每个值的准确度结果),而 rpart
仅使用0.01 除非你另有说明(参见 ?rpart.control
).交叉验证也需要更长的时间,特别是因为 caret
默认使用引导程序.
caret
actually does quite a bit more under the hood. In particular, it uses cross-validation to optimize the model hyperparameters. In your case, it tries three values of cp
(type modFit
and you'll see accuracy results for each value), whereas rpart
just uses 0.01 unless you tell it otherwise (see ?rpart.control
). The cross-validation will also take longer, especially since caret
uses bootstrapping by default.
为了得到类似的结果,你需要禁用交叉验证并指定cp
:
In order to get similar results, you need to disable cross-validation and specify cp
:
modFit <- caret::train(y ~ ., method = "rpart", data = training,
trControl=trainControl(method="none"),
tuneGrid=data.frame(cp=0.01))
此外,您应该为两个模型使用相同的随机种子.
In addition, you should use the same random seed for both models.
也就是说,caret
提供的额外功能是一件好事,您可能应该使用 caret
.如果您想了解更多信息,它有很好的文档记录,而且作者有一本出色的书,Applied Predictive Modeling.
That said, the extra functionality that caret
provides is a Good Thing, and you should probably just go with caret
. If you want to learn more, it's well-documented, and the author has a stellar book, Applied Predictive Modeling.
这篇关于为什么使用 caret::train(..., method = "rpart") 的结果与 rpart::rpart(...) 不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!