问题描述
我正在使用 RandomForestClassifier
处理多类分类问题.目标变量 Y
仅包含 3 个值 {-1,0,1} 之一.我了解数字编码是必要的.
I am working on a multiclass classification problem using RandomForestClassifier
. The target variable Y
only contain one of 3 values {-1,0,1 }. I understand that numerical encoding is necessary.
但是,我想了解是否有必要通过执行 pd.get_dummies(Y)
来转换 Y
以获得如下所示的指标矩阵,然后将此指标矩阵输入 RandomForestClassifier
?
However, I would like to understand if it is necessary for me to transform Y
to obtain an indicator matrix like below by doing pd.get_dummies(Y)
and then feed this indicator matrix into the RandomForestClassifier
?
-1.0 0.0 1.0
0 0 0 1
1 1 0 0
2 0 0 1
3 1 0 0
4 1 0 0
... ... ...
6516 1 0 0
6517 0 0 1
6518 0 0 1
6519 0 0 1
6520 1 0 0
与将未变换的目标变量 Y
(即一维序列)输入 RandomForestClassifier
相比,这将如何影响机器学习算法?结果会不同吗?为什么?
Comparing above to feeding the untransformed target variable Y
(i.e. a 1 dimensional series) into RandomForestClassifier
, how would this affect the machine learning algorithm ? Would the results be different and why ?
RandomForestClassifier
在这两种不同的情况下做不同的事情吗?推荐哪种方法(指标矩阵与未变换)?
Is the RandomForestClassifier
doing different things under these 2 different scenarios ?Which approach is recommended (indicator matrix vs untransformed)?
推荐答案
我认为没有任何理由偏爱其中一个.文档声明您可以将形状为 (n_samples,)
或 (n_samples, n_outputs)
的类似数组作为 y
传递给 sklearn.ensemble.RandomForestClassifier.fit()
.
I don't think there's any reason to prefer one over the other. The documentation states that you can pass an array-like of shape (n_samples,)
or (n_samples, n_outputs)
as y
to sklearn.ensemble.RandomForestClassifier.fit()
.
唯一的区别是 .predict()
如何返回预测的类.我建议您根据需要进行预测的格式来决定 Y
的形状.
The only difference would be how .predict()
returns the predicted classes. I recommend you decide the shape of Y
based on the format that you need the predictions to be in.
除此之外,每个估计量的拆分过程完全相同.
Aside from that, the splitting process of each estimator is the exact same.
这篇关于给定数值目标变量,我是否应该转换目标变量以获得多类分类的指标矩阵?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!