我正在尝试按照here步骤在插入符号中构建RandomForest模型。本质上,他们设置了RandomForest,然后设置了最好的mtry,然后设置了最好的maxnodes,然后设置了最好的树数。这些步骤是合理的,但不是一次搜索这三个因素而不是一次搜索会更好吗?

其次,我了解对mtry和ntree执行网格搜索。但是我不知道如何设置最小节点数或最大节点数。通常是否建议保留默认节点大小,如下所示?

library(randomForest)
library(caret)
mtrys<-seq(1,4,1)
ntrees<-c(250, 300, 350, 400, 450, 500, 550, 600, 800, 1000, 2000)
combo_mtrTrees<-data.frame(expand.grid(mtrys, ntrees))
colnames(combo_mtrTrees)<-c('mtrys','ntrees')

tuneGrid <- expand.grid(.mtry = c(1: 4))
for (i in 1:length(ntrees)){
  ntree<-ntrees[i]
  set.seed(65)
  rf_maxtrees <- train(Species~.,
                       data = df,
                       method = "rf",
                       importance=TRUE,
                       metric = "Accuracy",
                       tuneGrid = tuneGrid,
                       trControl = trainControl( method = "cv",
                                                 number=5,
                                                 search = 'grid',
                                                 classProbs = TRUE,
                                                 savePredictions = "final"),
                       ntree = ntree
                       )
  Acc1<-rf_maxtrees$results$Accuracy[rf_maxtrees$results$mtry==1]
  Acc2<-rf_maxtrees$results$Accuracy[rf_maxtrees$results$mtry==2]
  Acc3<-rf_maxtrees$results$Accuracy[rf_maxtrees$results$mtry==3]
  Acc4<-rf_maxtrees$results$Accuracy[rf_maxtrees$results$mtry==4]
  combo_mtrTrees$Acc[combo_mtrTrees$mtrys==1 & combo_mtrTrees$ntrees==ntree]<-Acc1
  combo_mtrTrees$Acc[combo_mtrTrees$mtrys==2 & combo_mtrTrees$ntrees==ntree]<-Acc2
  combo_mtrTrees$Acc[combo_mtrTrees$mtrys==3 & combo_mtrTrees$ntrees==ntree]<-Acc3
  combo_mtrTrees$Acc[combo_mtrTrees$mtrys==4 & combo_mtrTrees$ntrees==ntree]<-Acc4
}

最佳答案

  • 是的,搜索参数的交互会更好。
  • nodesizemaxnodes通常默认情况下保留,但是没有理由不对其进行调整。我个人将默认保留maxnodes或调整nodesize-可以将其视为正则化参数。要了解尝试使用哪些值,请检查 rf 中的默认值,对于分类而言,默认值为1,对于回归而言,默认值为5。因此,尝试1-10是一种选择。
  • 在像您的示例中那样在循环中执行优化时,建议始终使用相同的交叉验证折叠。您可以在调用循环之前使用 createFolds 创建它们。
  • 调整后,请确保在独立的验证集上评估您的结果,或执行nested cross validation,其中将使用内部循环来调整参数,使用外部循环来估计模型性能。由于仅交叉验证的结果将存在乐观偏见。
  • 在大多数情况下,精度不是选择最佳分类模型的合适指标。尤其是在数据集不平衡的情况下。读取接收器工作特性auc,科恩kappa,马修斯相关系数,平衡精度,F1得分,分类阈值调整。
  • 这是有关如何共同调整rf参数的示例。我将使用mlbench包中的Sonar数据集。

  • 创建预定义的折叠:
    library(caret)
    library(mlbench)
    data(Sonar)
    
    set.seed(1234
    cv_folds <- createFolds(Sonar$Class, k = 5, returnTrain = TRUE)
    

    创建调音控件:
    tuneGrid <- expand.grid(.mtry = c(1 : 10))
    
    ctrl <- trainControl(method = "cv",
                         number = 5,
                         search = 'grid',
                         classProbs = TRUE,
                         savePredictions = "final",
                         index = cv_folds,
                         summaryFunction = twoClassSummary) #in most cases a better summary for two class problems
    

    定义其他参数进行调整。我将使用一些组合来限制示例的训练时间:
    ntrees <- c(500, 1000)
    nodesize <- c(1, 5)
    
    params <- expand.grid(ntrees = ntrees,
                          nodesize = nodesize)
    

    培养:
    store_maxnode <- vector("list", nrow(params))
    for(i in 1:nrow(params)){
      nodesize <- params[i,2]
      ntree <- params[i,1]
      set.seed(65)
      rf_model <- train(Class~.,
                           data = Sonar,
                           method = "rf",
                           importance=TRUE,
                           metric = "ROC",
                           tuneGrid = tuneGrid,
                           trControl = ctrl,
                           ntree = ntree,
                           nodesize = nodesize)
      store_maxnode[[i]] <- rf_model
      }
    

    合并结果:
    results_mtry <- resamples(store_maxnode)
    
    summary(results_mtry)
    

    输出:
    Call:
    summary.resamples(object = results_mtry)
    
    Models: Model1, Model2, Model3, Model4
    Number of resamples: 5
    
    ROC
                Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
    Model1 0.9108696 0.9354067 0.9449761 0.9465758 0.9688995 0.9727273    0
    Model2 0.8847826 0.9473684 0.9569378 0.9474828 0.9665072 0.9818182    0
    Model3 0.9163043 0.9377990 0.9569378 0.9481652 0.9593301 0.9704545    0
    Model4 0.9000000 0.9342105 0.9521531 0.9462321 0.9641148 0.9806818    0
    
    Sens
                Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
    Model1 0.9090909 0.9545455 0.9545455 0.9549407 0.9565217 1.0000000    0
    Model2 0.9090909 0.9130435 0.9545455 0.9371542 0.9545455 0.9545455    0
    Model3 0.9090909 0.9545455 0.9545455 0.9458498 0.9545455 0.9565217    0
    Model4 0.9090909 0.9545455 0.9545455 0.9458498 0.9545455 0.9565217    0
    
    Spec
           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
    Model1 0.65 0.6842105 0.7368421 0.7421053 0.7894737 0.8500000    0
    Model2 0.60 0.6842105 0.7894737 0.7631579 0.8421053 0.9000000    0
    Model3 0.55 0.6842105 0.7894737 0.7331579 0.8000000 0.8421053    0
    Model4 0.60 0.6842105 0.7368421 0.7321053 0.7894737 0.8500000    0
    

    要获得每种型号的最佳mtry:
    lapply(store_maxnode, print)
    

    或者,您可以使用默认摘要
    ctrl <- trainControl(method = "cv",
                             number = 5,
                             search = 'grid',
                             classProbs = TRUE,
                             savePredictions = "final",
                             index = cv_folds)
    

    并在metric = "Kappa"中定义train

    08-24 17:02