问题描述
我正在使用随机森林分类器进行特征选择.我总共有70个功能,并且我要从70个功能中选择最重要的功能.下面的代码显示了分类器,从最重要到最不重要显示了这些功能.
I am using random forest classifier for feature selection. I have 70 features in all and I want to select the most important features out of 70. Below code shows the classifier displaying the features from most significant to least significant.
代码:
feat_labels = data.columns[1:]
clf = RandomForestClassifier(n_estimators=100, random_state=0)
# Train the classifier
clf.fit(X_train, y_train)
importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]
for f in range(X_train.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))
现在我正尝试使用sklearn.feature_selection
中的SelectFromModel
,但是如何确定给定数据集的阈值.
Now I am trying to use SelectFromModel
from sklearn.feature_selection
but how can I decide the threshold value for my given dataset.
# Create a selector object that will use the random forest classifier to identify
# features that have an importance of more than 0.15
sfm = SelectFromModel(clf, threshold=0.15)
# Train the selector
sfm.fit(X_train, y_train)
当我尝试threshold=0.15
然后尝试训练我的模型时,出现错误,提示数据太嘈杂或选择太严格.
When I try threshold=0.15
and then try to train my model I get an error saying data is too noisy or the selection is too strict.
但是,如果我使用threshold=0.015
,则可以在选定的新功能上训练我的模型,那么如何确定该阈值?
But if I use threshold=0.015
I am able to train my model on selected new features So how can I decide this threshold value ?
推荐答案
我会尝试以下方法:
- 以较低的阈值开始,例如:
1e-4
- 使用
SelectFromModel
fit&缩小功能转换 - 针对所选功能为估算器(在您的情况下为
RandomForestClassifier
)计算指标(准确性等) - 提高阈值并重复从第1点开始的所有步骤.
- start with a low threshold, for example:
1e-4
- reduce your features using
SelectFromModel
fit & transform - compute metrics (accuracy, etc.) for your estimator (
RandomForestClassifier
in your case) for selected features - increase threshold and repeat all steps starting from point 1.
使用这种方法,您可以估算出最适合您的特定数据和估算器的threshold
Using this approach you can estimate what is the best threshold
for your particular data and your estimator
这篇关于如何在SelectFromModel()中确定用于选择特征的阈值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!