问题描述
我是第一次处理随机森林,但遇到了一些我无法弄清楚的麻烦..当我对所有数据集(大约 3000 行)运行分析时,我没有收到任何错误消息.但是当我对数据集的一个子集(大约 300 行)执行相同的分析时,我得到一个错误:
I'm dealing for the first time with random forests and I'm having some troubles that I can't figure out..When I run the analysis on all my dataset (about 3000 rows) I don't get any error message. But when I perform the same analysis on a subset of my dataset (about 300 rows) I get an error:
dataset <- read.csv("datasetNA.csv", sep=";", header=T)
names (dataset)
dataset2 <- dataset[complete.cases(dataset$response),]
library(randomForest)
dataset2 <- na.roughfix(dataset2)
data.rforest <- randomForest(dataset2$response ~ dataset2$predictorA + dataset2$predictorB+ dataset2$predictorC + dataset2$predictorD + dataset2$predictorE + dataset2$predictorF + dataset2$predictorG + dataset2$predictorH + dataset2$predictorI, data=dataset2, ntree=100, keep.forest=FALSE, importance=TRUE)
# subset of my original dataset:
groupA<-dataset2[dataset2$order=="groupA",]
data.rforest <- randomForest(groupA$response ~ groupA$predictorA + groupA$predictorB+ groupA$predictorC + groupA$predictorD + groupA$predictorE + groupA$predictorF + groupA$predictorG + groupA$predictorH + groupA$predictorI, data=groupA, ntree=100, keep.forest=FALSE, importance=TRUE)
Error in randomForest.default(m, y, ...) : Can't have empty classes in y.
但是,我的响应变量没有任何空类.
However, my response variable hasn't any empty class.
如果我像这样写 randomForest (a+b+c,y)
而不是 (y ~ a+b+c)
我会收到另一条消息:
If instead I write randomForest like this (a+b+c,y)
instead than (y ~ a+b+c)
I get this other message:
Error in if (n == 0) stop("data (x) has 0 rows") :
argument length zero
Warning messages:
1: In Ops.factor(groupA$responseA + groupA$responseB, :
+ not meaningful for factors
第二个问题是,当我尝试通过 rfImpute()
估算我的数据时,出现错误:
The second problem is that when I try to impute my data through rfImpute()
I get an error:
Errore in na.roughfix.default(x) : roughfix can only deal with numeric data
但是我的列都是因数和数字.
However my columns are all factors and numeric.
有人能看出我错在哪里吗???
Can somebody see where I'm wrong???
推荐答案
根据评论中的讨论,这里有一个可能的解决方案的猜测.
Based on the discussion in the comments, here's a guess at a potential solution.
这里的混淆源于这样一个事实,即因子的水平是变量的一个属性.无论您采用数据的哪个子集,无论该子集有多小,这些级别都将保持不变.这是一个功能,而不是一个错误,并且是一个常见的混淆源.
The confusion here arises from the fact that the levels of a factor are an attribute of the variable. Those levels will remain the same, no matter what subset you take of the data, no matter how small that subset. This is a feature, not a bug, and a common source of confusion.
如果您想在设置子集时删除缺失的级别,请将您的子集操作包装在 droplevels()
中:
If you want to drop missing levels when subsetting, wrap your subset operation in droplevels()
:
groupA <- droplevels(dataset2[dataset2$order=="groupA",])
我可能还应该补充说,许多 R 用户在开始新会话时(例如在他们的 .Rprofile 文件中)设置 options(stringsAsFactors = FALSE)
以避免这些麻烦.这样做的缺点是,如果您经常与其他人共享您的代码,如果他们没有更改 R 的默认选项,这可能会导致问题.
I should probably also add that many R users set options(stringsAsFactors = FALSE)
when starting a new session (e.g. in their .Rprofile file) to avoid these kinds of hassles. The downside to doing this is that if you share your code with other people frequently, this can cause problems if they haven't altered R's default options.
这篇关于R 中的随机森林(y 中的空类和参数 legth 0)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!