问题描述
在为binary classification problem
运行6个模型之后,我对test set
具有以下评估指标:
I have the following evaluation metrics on the test set
, after running 6 models for a binary classification problem
:
accuracy logloss AUC
1 19% 0.45 0.54
2 67% 0.62 0.67
3 66% 0.63 0.68
4 67% 0.62 0.66
5 63% 0.61 0.66
6 65% 0.68 0.42
我有以下问题:
- 如何建模
1
在logloss
方面最好(logloss
最接近0),因为它表现最差(在accuracy
方面).这是什么意思? - 为什么模型
6
的AUC
得分低于例如型号5
,当型号6
具有更好的accuracy
时.这是什么意思? - 有没有办法说这6个模型中哪一个最好?
- How can model
1
be the best in terms oflogloss
(thelogloss
is the closest to 0) since it performs the worst (in terms ofaccuracy
). What does that mean ? - How come does model
6
have lowerAUC
score than e.g. model5
, when model6
has betteraccuracy
. What does that mean ? - Is there a way to say which of these 6 models is the best ?
推荐答案
非常简短,带有链接(部分链接已在其他地方进行了讨论)...
Very briefly, with links (as parts of this have already been discussed elsewhere)...
尽管损失是准确性的代表(反之亦然),但在这种情况下,它并不是非常可靠的.仔细研究准确度和损失之间的具体机制可能很有用.考虑以下SO线程(免责声明:答案是我的):
Although loss is a proxy for the accuracy (or vice versa), it is not a very reliable one in that matter. A closer look at the specific mechanics between accuracy and loss may be useful here; consider the following SO threads (disclaimer: answers are mine):
- 损失和准确性-这些学习曲线是否合理?
- Keras如何评估准确性?(尽管标题,这是一个一般性的说明,不仅仅限于Keras)
- Loss & accuracy - Are these reasonable learning curves?
- How does Keras evaluate the accuracy? (despite the title, it is a general exposition, and not confined to Keras in particular)
详细说明:
假设一个带有真实标签y=1
的样本,来自p=0.51
分类器的概率预测和判定阈值0.5(即,对于p>0.5
,我们将其分类为1
,否则将其分类为0
),该样本对准确性的贡献为1/n
(即正数),而损失为
Assuming a sample with true label y=1
, a probabilistic prediction from the classifier of p=0.51
, and a decision threshold of 0.5 (i.e. for p>0.5
we classify as 1
, otherwise as 0
), the contribution of this sample to the accuracy is 1/n
(i.e. positive), while the loss is
-log(p) = -log(0.51) = 0.6733446
现在,再次使用真实的y=1
假设另一个样本,但是现在具有p=0.99
的概率预测;对精度的贡献将是相同的,而现在的损失将是:
Now, assume another sample again with true y=1
, but now with a probabilistic prediction of p=0.99
; the contribution to the accuracy will be the same, while the loss now will be:
-log(p) = -log(0.99) = 0.01005034
因此,对于两个都正确分类的样本(即,它们以完全相同的数量对准确度有积极贡献),我们在相应损失上有相当大的差异...
So, for two samples that are both correctly classified (i.e. they contribute positively to the accuracy by the exact same quantity), we have a rather huge difference in the corresponding losses...
尽管您在此处呈现的内容看起来很极端,但不难想象这样一种情况,其中许多y=1
样本将位于p=0.49
周围,因此给出了一个相对低损耗,但对精度的贡献为零...
Although what you present here seems rather extreme, it shouldn't be difficult to imagine a situation where many samples of y=1
will be around the area of p=0.49
, hence giving a relatively low loss but a zero contribution to accuracy nonetheless...
这个比较容易.
至少根据我的经验,大多数机器学习从业者认为AUC评分与实际操作有所不同:常见(不幸的是)用法与其他任何较高级别的用法相同,更好的指标(如准确性)自然会导致您表达自己的困惑.
According to my experience at least, most ML practitioners think that the AUC score measures something different from what it actually does: the common (and unfortunate) use is just like any other the-higher-the-better metric, like accuracy, which may naturally lead to puzzles like the one you express yourself.
事实是,粗略地说,AUC衡量的是在所有可能的决策阈值上平均的二元分类器的性能.因此,AUC并没有实际衡量特定部署模型(包括选定的决策阈值)的性能,而是模型的族在所有阈值上的平均性能(其中绝大多数是当然对您不感兴趣,因为它们将永远不会使用.
The truth is that, roughly speaking, the AUC measures the performance of a binary classifier averaged across all possible decision thresholds. So, the AUC does not actually measure the performance of a particular deployed model (which includes the chosen decision threshold), but the averaged performance of a family of models across all thresholds (the vast majority of which are of course of not interest to you, as they will be never used).
由于这个原因,AUC开始受到文学界的严重批评(请不要误解-对 ROC曲线本身的分析非常有用,也很有用); Wikipedia条目及其中提供的参考文献是强烈推荐的阅读内容:
For this reason, AUC has started receiving serious criticism in the literature (don't misread this - the analysis of the ROC curve itself is highly informative and useful); the Wikipedia entry and the references provided therein are highly recommended reading:
[...]
最近对ROC AUC问题的一种解释是,将ROC曲线简化为一个数字会忽略以下事实:它是关于不同系统或绘制的性能点之间的权衡,而不是单个系统的性能
One recent explanation of the problem with ROC AUC is that reducing the ROC Curve to a single number ignores the fact that it is about the tradeoffs between the different systems or performance points plotted and not the performance of an individual system
强调地雷-另请参见关于AUC的危险 ...
Emphasis mine - see also On the dangers of AUC...
简单建议:不要使用.
取决于最佳"的确切定义;如果最佳"的意思是我要解决的业务问题的最佳(不是ML 从业者的不合理定义),那么它是表现最好的那个根据适合于您自己定义的问题的业务指标.这永远不会是AUC,通常也不是损失...
Depends of the exact definition of "best"; if "best" means best for my own business problem that I am trying to solve (not an irrational definition for an ML practitioner), then it is the one that performs better according to the business metric appropriate for your problem that you have defined yourself. This can never be the AUC, and normally it is also not the loss...
这篇关于比较模型之间的AUC,对数损失和准确性得分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!