本文介绍了使用包“caret"在随机森林的 K 折验证中进行变量编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想运行一个 RF 分类,就像它在randomForest"中指定的那样,但仍然使用 k-fold 重复交叉验证方法(下面的代码).如何阻止插入符号从我的分类变量中创建虚拟变量?我读到这可能是由于 One-Hot-Encoding,但不确定如何更改.我会非常感谢有关如何解决此问题的一些示例行!

I want to run a RF classification just like it's specified in 'randomForest' but still use the k-fold repeated cross validation method (code below). How do I stop caret from creating dummy variables out of my categorical ones? I read that this may be due to One-Hot-Encoding, but not sure how to change this. I would be very greatful for some example lines on how to fix this!

数据库:

> str(river)
'data.frame':   121 obs. of  13 variables:
 $ stat_bino     : Factor w/ 2 levels "0","1": 2 2 1 1 2 2 2 2 2 2 ...
 $ Subfamily     : Factor w/ 14 levels "carettochelyinae",..: 14 14 14 14 8 8 8 8 8 8 ...
 $ MAXCL         : num  850 850 360 540 625 600 760 480 560 580 ...
 $ CS            : num  8 8 14 15 26 25.5 20 20 18 21.5 ...
 $ CF            : num  3.5 3.5 2.5 2 1.5 3 2 2 1 1 ...
 $ size_mat      : num  300 300 170 180 450 450 460 406 433 433 ...
 $ incubat       : num  97.5 97.5 71 72.5 91.5 67.5 73 55 83 80 ...
 $ diet          : Factor w/ 5 levels "omnivore leaning carnivore",..: 1 1 1 1 2 2 2 5 4 4 ...
 $ HDI           : num  721 627 878 885 704 ...
 $ HF09M93       : num  23.19 9.96 -8.52 -5.67 27.3 ...
 $ HF09          : num  116 121 110 110 152 ...
 $ deg_reg       : num  8.64 39.37 370.95 314.8 32.99 ...
 $ protected_area: num  7.55 10.93 2.84 2.89 12.71 …

其他:

> control <- trainControl(method='repeatedcv',
+                         number=5,repeats = 3,
+                         search='grid')

> tunegrid <- expand.grid(.mtry = (1:12))

> rf_gridsearch <- train(stat_bino ~ .,
+                        data = river,
+                        method = 'rf',
+                        metric = 'Accuracy',
+                        ntree = 600,
+                        importance = TRUE,
+                        tuneGrid = tunegrid, trControl = control)

> rf_gridsearch$finalModel[["xNames"]]
 [1] "Subfamilychelinae"              "Subfamilychelodininae"          "Subfamilychelydrinae"
 [4] "Subfamilycyclanorbinae"         "Subfamilydeirochelyinae"        "Subfamilydermatemydinae"
 [7] "Subfamilygeoemydinae"           "Subfamilykinosterninae"         "Subfamilypelomedusinae"

...you get the picture. I now have 27 predictors instead of 12.

推荐答案

当你使用公式界面进行训练时:

When you use the formula interface to train:

train(stat_bino ~ .,
      ...

它将使用虚拟编码转换因子.这是有道理的,因为大多数传统 R 函数中的公式都是这样工作的(例如 lm).

it will convert factors using dummy coding. This makes sense because formulas in most traditional R functions work this way (for instance lm).

但是,如果您使用非公式界面:

However if you use the non formula interface:

train(y = river$stat_bino,
      x = river[,colnames(river) != "stat_bino"],
      ...

然后插入符号将保留变量,因为它们被补充.这正是基于树的方法所需要的,但是对于无法在内部处理诸如 glmnet 等因素的算法,它会产生错误.

then caret will leave the variables as they are suppled. This is what you want with tree based methods, but it will produce errors with algorithms not capable of internally handling factors such as glmnet.

这篇关于使用包“caret"在随机森林的 K 折验证中进行变量编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-13 19:34