问题描述
我有一个二进制分类问题,在该问题中,我使用以下代码获取我的加权平均变量精度
,加权平均变量召回率
,加权平均变量f度量
和 roc_auc
.
I have a binary classification problem where I use the following code to get my weighted avarege precision
, weighted avarege recall
, weighted avarege f-measure
and roc_auc
.
df = pd.read_csv(input_path+input_file)
X = df[features]
y = df[["gold_standard"]]
clf = RandomForestClassifier(random_state = 42, class_weight="balanced")
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
scores = cross_validate(clf, X, y, cv=k_fold, scoring = ('accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc'))
print("accuracy")
print(np.mean(scores['test_accuracy'].tolist()))
print("precision_weighted")
print(np.mean(scores['test_precision_weighted'].tolist()))
print("recall_weighted")
print(np.mean(scores['test_recall_weighted'].tolist()))
print("f1_weighted")
print(np.mean(scores['test_f1_weighted'].tolist()))
print("roc_auc")
print(np.mean(scores['test_roc_auc'].tolist()))
对于具有2种不同特征设置的同一数据集,我得到了以下结果.
I got the following results for the same dataset with 2 different feature settings.
Feature setting 1 ('accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc'):
0.6920, 0.6888, 0.6920, 0.6752, 0.7120
Feature setting 2 ('accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc'):
0.6806 0.6754 0.6806 0.6643 0.7233
因此,我们可以看到,在功能设置1
中,与功能设置2相比,准确性","precision_weighted","recall_weighted","f1_weighted"得到了不错的结果代码>.
So, we can see that in feature setting 1
we get good results for 'accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted' compared to feature setting 2
.
但是,当涉及"roc_auc"时,功能设置2
优于功能设置1
.我发现这很奇怪,因为在功能设置1
下,其他所有指标都更好.
However, when it comes to 'roc_auc' feature setting 2
is better than feature setting 1
. I found this weird becuase every other metric was better with feature setting 1
.
一方面,我怀疑发生这种情况,因为我将加权
分数用于精度,召回率和f-measure
,而不是将 roc_auc
.可以对sklearn进行 weighted roc_auc
进行二进制分类吗?
On one hand, I suspect that this happens since I am using weighted
scores for precision, recall and f-measure
and not with roc_auc
. Is it possible to do weighted roc_auc
for binary classification in sklearn?
这个奇怪的roc_auc结果的真正问题是什么?
What is the real problem for this weird roc_auc results?
如果需要,我很乐意提供更多详细信息.
I am happy to provide more details if needed.
推荐答案
这并不奇怪,因为将所有其他指标与AUC进行比较就像将苹果与橙子进行比较.
It is not weird, because comparing all these other metrics with AUC is like comparing apples to oranges.
这是整个过程的高级描述:
Here is a high-level description of the whole process:
- 概率分类器(如此处的RF)在
[0,1]
中产生概率输出p
. - 要获得硬类预测(
0/1
),我们对这些概率应用阈值;如果未明确设置(如此处),则此阈值将隐式设为0.5,即,如果p> 0.5
则为class = 1
,否则为class = 0
. - 准确度,准确性,召回率和f1-得分等指标是根据硬类预测
0/1
(即在应用阈值后 )计算的. li> - 相比之下,AUC衡量的是在所有可能阈值的范围内而不是特定阈值上平均的二元分类器的性能.
- Probabilistic classifiers (like RF here) produce probability outputs
p
in[0, 1]
. - To get hard class predictions (
0/1
), we apply a threshold to these probabilities; if not set explicitly (like here), this threshold is implicitly taken to be 0.5, i.e. ifp>0.5
thenclass=1
, elseclass=0
. - Metrics like accuracy, precision, recall, and f1-score are calculated over the hard class predictions
0/1
, i.e after the threshold has been applied. - In contrast, AUC measures the performance of a binary classifier averaged over the range of all possible thresholds, and not for a particular threshold.
因此,它肯定会发生,并且确实会导致新开业医生之间的困惑.
So, it can certainly happen, and it can indeed lead to confusion among new practitioners.
我的答案的第二部分在这个类似的问题可能有助于获得更多详细信息.报价:
The second part of my answer in this similar question might be helpful for more details. Quoting:
这篇关于为什么roc_auc在sklearn中产生奇怪的结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!