r - 通过插入符返回奇怪的值来调整mtry

我使用mtry包中的randomForest函数调整train的caret参数。我的48数据中只有X列，但是train返回mtry=50作为最佳值，但这不是有效值（>48）。对此有何解释？

> dim(X)
[1] 93 48
> fit <- train(level~., data=data.frame(X,level), tuneLength=13)
> fit$finalModel

Call:
 randomForest(x = x, y = y, mtry = param$mtry)
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 50

        OOB estimate of  error rate: 2.15%
Confusion matrix:
     high low class.error
high   81   1  0.01219512
low     1  10  0.09090909

如果不设置tuneLength参数，情况更糟：

> fit <- train(level~., data=data.frame(X,level))
> fit$finalModel

Call:
 randomForest(x = x, y = y, mtry = param$mtry)
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 55

        OOB estimate of  error rate: 2.15%
Confusion matrix:
     high low class.error
high   81   1  0.01219512
low     1  10  0.09090909

我不提供数据，因为它是机密的。但是这些数据没有什么特别的：每一列都是数字或一个因子，并且没有缺失值。

最佳答案

数据集中的列数和预测变量数之间很可能存在明显的差异[1]，如果有任何列是因素，则可能会不同。您使用了公式方法，该方法会将因子扩展为虚拟变量。例如：

> head(model.matrix(Sepal.Width ~ ., data = iris))
  (Intercept) Sepal.Length Petal.Length Petal.Width Speciesversicolor Speciesvirginica
1           1          5.1          1.4         0.2                 0                0
2           1          4.9          1.4         0.2                 0                0
3           1          4.7          1.3         0.2                 0                0
4           1          4.6          1.5         0.2                 0                0
5           1          5.0          1.4         0.2                 0                0
6           1          5.4          1.7         0.4                 0                0

因此，在iris中有3个预测变量列，但最终得到5个（非截距）预测变量。

最高

[1]这就是为什么您需要提供一个可重复的示例。通常，当我准备问一个问题时，答案就会变得很明显，而我却花了一些时间对此问题进行了很好的描述。