问题描述
我分别从 sklearn 的 RandomForestClassifier 和 roc_curve、auc 方法接收到不同的 ROC-AUC 分数.
I am receiving different ROC-AUC scores from sklearn's RandomForestClassifier and roc_curve, auc methods, respectively.
以下代码使我获得了 0.878 的 ROC-AUC(即 gs.best_score_):
The following code got me an ROC-AUC (i.e. gs.best_score_) of 0.878:
def train_model(mod = None, params = None, features = None,
outcome = ...outcomes array..., metric = 'roc_auc'):
gs = GridSearchCV(mod, params, scoring=metric, loss_func=None, score_func=None,
fit_params=None, n_jobs=-1, iid=True, refit=True, cv=10, verbose=0,
pre_dispatch='2*n_jobs', error_score='raise')
gs.fit(...feature set df..., outcome)
print gs.best_score_
print gs.best_params_
return gs
model = RandomForestClassifier(random_state=2000, n_jobs=-1)
features_to_include = [...list of column names...]
parameters = {
'n_estimators': [...list...], 'max_depth':[...list...],
'min_samples_split':[...list...], 'min_samples_leaf':[...list...]
}
gs = train_model(mod = model, params = parameters, features = features_to_include)
然而,以下代码使我获得了 0.97 的 ROC-AUC:
Whereas, the following code got me an ROC-AUC of 0.97:
fpr = dict()
tpr = dict()
roc_auc = dict()
fpr['micro'], tpr['micro'], _ = roc_curve(...outcomes array...,
gs.predict_proba(...feature set df...)[:, 1])
roc_auc['micro'] = auc(fpr['micro'], tpr['micro'])
为什么会有这么大的差别?我的代码做错了吗?
Why is there such a difference? Did I do something wrong with my code?
谢谢!克里斯
推荐答案
它们会返回不同的值,原因有两个:
They would return different values, for two reasons:
由于
GridSearchCV
方法将您的数据分成 10 组(您在代码中进行 10 倍交叉验证),使用 9 进行训练,并报告最后一个的 AUC团体.你得到的 best_score_ 只是报告的最高 AUC(更多信息阅读 此处).您的roc_curve
计算报告了整个集合的 AUC.
since the
GridSearchCV
method splits your data into 10 groups (you are doing 10-fold cross-validation in your code), uses 9 for training, and reports the AUC on the last group. The best_score_ you get is just the highest-reported AUC reported as such (more info read here). Yourroc_curve
calculation reports the AUC on the entire set.
默认的交叉验证 roc_auc
是宏版本(参见 此处),但您稍后的计算会计算微版本.
The default cross-validation roc_auc
is the macro version (see here), but your later computation computes the micro version.
这篇关于sklearn RandomForestClassifier 与 auc 方法中 ROC-AUC 分数的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!