使用随机森林的基于AUC的特征重要性 | 使用随机森林的基于AUC的特征重要性

本文介绍了使用随机森林的基于AUC的特征重要性的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试通过随机森林和逻辑回归来预测一个二进制变量.我有严重失衡的班级(约占Y = 1的1.5％).

I'm trying to predict a binary variable with both random forests and logistic regression. I've got heavily unbalanced classes (approx 1.5% of Y=1).

随机森林中的默认特征重要性技术基于分类准确度(错误率)-对于不平衡的类，这已被证明是一种不好的衡量标准(请参见此处和此处).

The default feature importance techniques in random forests are based on classification accuracy (error rate) - which has been shown to be a bad measure for unbalanced classes (see here and here).

我的问题是:这种方法是在scikit-learn中实现的吗(就像在R包party中一样)?还是解决方法?

My question is : is that kind of method implemented in scikit-learn (like it is in the R package party) ? Or maybe a workaround ?

PS:此问题与.

推荐答案

做完一些研究之后，我得出了这样的结论:

After doing some researchs, this is what I came out with :

from sklearn.cross_validation import ShuffleSplit
from collections import defaultdict

names = db_train.iloc[:,1:].columns.tolist()

# -- Gridsearched parameters
model_rf = RandomForestClassifier(n_estimators=500,
                                 class_weight="auto",
                                 criterion='gini',
                                 bootstrap=True,
                                 max_features=10,
                                 min_samples_split=1,
                                 min_samples_leaf=6,
                                 max_depth=3,
                                 n_jobs=-1)
scores = defaultdict(list)

# -- Fit the model (could be cross-validated)
rf = model_rf.fit(X_train, Y_train)
acc = roc_auc_score(Y_test, rf.predict(X_test))

for i in range(X_train.shape[1]):
    X_t = X_test.copy()
    np.random.shuffle(X_t[:, i])
    shuff_acc = roc_auc_score(Y_test, rf.predict(X_t))
    scores[names[i]].append((acc-shuff_acc)/acc)

print("Features sorted by their score:")
print(sorted([(round(np.mean(score), 4), feat) for
              feat, score in scores.items()], reverse=True))

Features sorted by their score:
[(0.0028999999999999998, 'Var1'), (0.0027000000000000001, 'Var2'), (0.0023999999999999998, 'Var3'), (0.0022000000000000001, 'Var4'), (0.0022000000000000001, 'Var5'), (0.0022000000000000001, 'Var6'), (0.002, 'Var7'), (0.002, 'Var8'), ...]

输出不是很性感，但是您明白了.这种方法的缺点是功能重要性似乎非常依赖于参数.我使用不同的参数(max_depth，max_features ..)运行了它，结果得到了很多不同.因此，我决定对参数(scoring = 'roc_auc')进行网格搜索，然后将此VIM(变量重要性测度)应用于最佳模型.

The output is not very sexy, but you got the idea. The weakness of this approach is that feature importance seems to be very parameters dependent. I ran it using differents params (max_depth, max_features..) and I'm getting a lot different results. So I decided to run a gridsearch on parameters (scoring = 'roc_auc') and then apply this VIM (Variable Importance Measure) to the best model.

我从这个(伟大的)笔记本.

最欢迎所有建议/评论！

All suggestions/comments are most welcome !

这篇关于使用随机森林的基于AUC的特征重要性的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！