问题描述
我想运行一个 RF 分类,就像它在randomForest"中指定的那样,但仍然使用 k-fold 重复交叉验证方法(下面的代码).如何阻止插入符号从我的分类变量中创建虚拟变量?我读到这可能是由于 One-Hot-Encoding,但不确定如何更改.我会非常感谢有关如何解决此问题的一些示例行!
I want to run a RF classification just like it's specified in 'randomForest' but still use the k-fold repeated cross validation method (code below). How do I stop caret from creating dummy variables out of my categorical ones? I read that this may be due to One-Hot-Encoding, but not sure how to change this. I would be very greatful for some example lines on how to fix this!
数据库:
> str(river)
'data.frame': 121 obs. of 13 variables:
$ stat_bino : Factor w/ 2 levels "0","1": 2 2 1 1 2 2 2 2 2 2 ...
$ Subfamily : Factor w/ 14 levels "carettochelyinae",..: 14 14 14 14 8 8 8 8 8 8 ...
$ MAXCL : num 850 850 360 540 625 600 760 480 560 580 ...
$ CS : num 8 8 14 15 26 25.5 20 20 18 21.5 ...
$ CF : num 3.5 3.5 2.5 2 1.5 3 2 2 1 1 ...
$ size_mat : num 300 300 170 180 450 450 460 406 433 433 ...
$ incubat : num 97.5 97.5 71 72.5 91.5 67.5 73 55 83 80 ...
$ diet : Factor w/ 5 levels "omnivore leaning carnivore",..: 1 1 1 1 2 2 2 5 4 4 ...
$ HDI : num 721 627 878 885 704 ...
$ HF09M93 : num 23.19 9.96 -8.52 -5.67 27.3 ...
$ HF09 : num 116 121 110 110 152 ...
$ deg_reg : num 8.64 39.37 370.95 314.8 32.99 ...
$ protected_area: num 7.55 10.93 2.84 2.89 12.71 …
其他:
> control <- trainControl(method='repeatedcv',
+ number=5,repeats = 3,
+ search='grid')
> tunegrid <- expand.grid(.mtry = (1:12))
> rf_gridsearch <- train(stat_bino ~ .,
+ data = river,
+ method = 'rf',
+ metric = 'Accuracy',
+ ntree = 600,
+ importance = TRUE,
+ tuneGrid = tunegrid, trControl = control)
> rf_gridsearch$finalModel[["xNames"]]
[1] "Subfamilychelinae" "Subfamilychelodininae" "Subfamilychelydrinae"
[4] "Subfamilycyclanorbinae" "Subfamilydeirochelyinae" "Subfamilydermatemydinae"
[7] "Subfamilygeoemydinae" "Subfamilykinosterninae" "Subfamilypelomedusinae"
...you get the picture. I now have 27 predictors instead of 12.
推荐答案
当你使用公式界面进行训练时:
When you use the formula interface to train:
train(stat_bino ~ .,
...
它将使用虚拟编码转换因子.这是有道理的,因为大多数传统 R 函数中的公式都是这样工作的(例如 lm
).
it will convert factors using dummy coding. This makes sense because formulas in most traditional R functions work this way (for instance lm
).
但是,如果您使用非公式界面:
However if you use the non formula interface:
train(y = river$stat_bino,
x = river[,colnames(river) != "stat_bino"],
...
然后插入符号将保留变量,因为它们被补充.这正是基于树的方法所需要的,但是对于无法在内部处理诸如 glmnet
等因素的算法,它会产生错误.
then caret will leave the variables as they are suppled. This is what you want with tree based methods, but it will produce errors with algorithms not capable of internally handling factors such as glmnet
.
这篇关于使用包“caret"在随机森林的 K 折验证中进行变量编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!