问题描述
我有一个带有 NAs
的数据集.
I have a data set with NAs
sprinkled generously throughout.
此外,它还具有需要 factors()
的列.
In addition it has columns that need to be factors()
.
我正在使用 caret
包中的 rfe()
函数来选择变量.
I am using the rfe()
function from the caret
package to select variables.
使用 lmFuncs
的 rfe()
中的 functions=
参数似乎适用于具有 NAs 但不适用于因子变量的数据,而rfFuncs
适用于因子变量,但不适用于 NA.
It seems the functions=
argument in rfe()
using lmFuncs
works for the data with NAs but NOT on factor variables, while the rfFuncs
works for factor variables but NOT NAs.
有什么处理这个问题的建议吗?
Any suggestions for dealing with this?
我尝试了 model.matrix()
但它似乎只会引起更多问题.
I tried model.matrix()
but it seems to just cause more problems.
推荐答案
由于包之间在这些点上的行为不一致,更不用说在使用 caret
等更多元"包时的额外技巧了,我总是发现在进行任何机器学习之前,先处理 NA 和因子变量更容易.
Because of inconsistent behavior on these points between packages, not to mention the extra trickiness when going to more "meta" packages like caret
, I always find it easier to deal with NAs and factor variables up front, before I do any machine learning.
- 对于 NA,省略或估算(中位数、knn 等).
- 对于因子特征,您使用
model.matrix()
走在正确的轨道上.它将让您为不同级别的因子生成一系列虚拟"特征.典型的用法是这样的:
- For NAs, either omit or impute (median, knn, etc.).
- For factor features, you were on the right track with
model.matrix()
. It will let you generate a series of "dummy" features for the different levels of the factor. The typical usage is something like this:
> dat = data.frame(x=factor(rep(1:3, each=5)))
> dat$x
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
Levels: 1 2 3
> model.matrix(~ x - 1, data=dat)
x1 x2 x3
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0
5 1 0 0
6 0 1 0
7 0 1 0
8 0 1 0
9 0 1 0
10 0 1 0
11 0 0 1
12 0 0 1
13 0 0 1
14 0 0 1
15 0 0 1
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$x
[1] "contr.treatment"
另外,以防万一你还没有(尽管听起来你有),CRAN 上的 caret
小插曲非常好,并触及其中一些要点.http://cran.r-project.org/web/packages/caret/index.html
Also, just in case you haven't (although it sounds like you have), the caret
vignettes on CRAN are very nice and touch on some of these points. http://cran.r-project.org/web/packages/caret/index.html
这篇关于R caret/rfe 变量选择 factor() AND NAs的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!