问题描述
这里是我的问题的简要描述:
Here's a brief description of my problem:
- 我正在从事一项监督学习任务来训练一个二元分类器.
- 我有一个大类不平衡分布的数据集:8个负实例,每个正实例.
- 我使用 f-measure,即特异性和灵敏度之间的调和平均值来评估分类器的性能.
- I am working on a supervised learning task to train a binary classifier.
- I have a dataset with a large class imbalance distribution: 8 negative instances every one positive.
- I use the f-measure, i.e. the harmonic mean between specificity and sensitivity, to assess the performance of a classifier.
我绘制了几个分类器的 ROC 图,所有分类器都呈现出很好的 AUC,这意味着分类很好.然而,当我测试分类器并计算 f-measure 时,我得到了一个非常低的值.我知道这个问题是由数据集的类偏度引起的,现在,我发现了两个解决方案:
I plot the ROC graphs of several classifiers and all present a great AUC, meaning that the classification is good. However, when I test the classifier and compute the f-measure I get a really low value. I know that this issue is caused by the class skewness of the dataset and, by now, I discover two options to deal with it:
- 通过为数据集的实例分配权重来采用成本敏感方法(请参阅此post)
- 阈值分类器返回的预测概率,以减少误报和漏报的数量.
- Adopting a cost-sensitive approach by assigning weights to the dataset's instances (see this post)
- Thresholding the predicted probabilities returned by the classifiers, to reduce the number of false positives and false negatives.
我选择了第一个选项,这解决了我的问题(f-measure 令人满意).但是,现在,我的问题是:这些方法中哪一种更可取?有什么区别?
I went for the first option and that solved my issue (f-measure is satisfactory). BUT, now, my question is: which of these methods is preferable? And what are the differences?
P.S:我使用 Python 和 scikit-learn 库.
推荐答案
加权(成本敏感)和阈值都是成本敏感学习的有效形式.简而言之,您可以将两者视为如下:
Both weighting (cost-sensitive) and thresholding are valid forms of cost-sensitive learning. In the briefest terms, you can think of the two as follows:
从本质上讲,人们断言错误分类稀有类别的成本"比错误分类普通类别更糟糕.这在 SVM、ANN 和随机森林等算法中应用于算法级别.这里的限制包括算法是否可以处理权重.此外,这种方法的许多应用都试图解决更严重的错误分类的想法(例如,将患有胰腺癌的人归类为未患有癌症的人).在这种情况下,您知道为什么要确保即使在不平衡的设置中也对特定类别进行分类.理想情况下,您希望像优化任何其他模型参数一样优化成本参数.
Essentially one is asserting that the ‘cost’ of misclassifying the rare class is worse than misclassifying the common class. This is applied at the algorithmic level in such algorithms as SVM, ANN, and Random Forest. The limitations here consist of whether the algorithm can deal with weights. Furthermore, many applications of this are trying to address the idea of making a more serious misclassification (e.g. classifying someone who has pancreatic cancer as non having cancer). In such circumstances, you know why you want to make sure you classify specific classes even in imbalanced settings. Ideally you want to optimize the cost parameters as you would any other model parameter.
如果算法返回概率(或一些其他分数),则可以在构建模型后应用阈值.本质上,您将分类阈值从 50-50 更改为适当的权衡级别.这通常可以通过生成评估指标的曲线(例如 F-measure)来优化.这里的限制是您要进行绝对的权衡.截止点的任何修改都会反过来降低预测其他类别的准确性.如果您对大多数常见类(例如大多数高于 0.85)的概率非常高,则使用此方法更有可能取得成功.它也是独立于算法的(前提是算法返回概率).
If the algorithm returns probabilities (or some other score), thresholding can be applied after a model has been built. Essentially you change the classification threshold from 50-50 to an appropriate trade-off level. This typically can be optimized by generated a curve of the evaluation metric (e.g. F-measure). The limitation here is that you are making absolute trade-offs. Any modification in the cutoff will in turn decrease the accuracy of predicting the other class. If you have exceedingly high probabilities for the majority of your common classes (e.g. most above 0.85) you are more likely to have success with this method. It is also algorithm independent (provided the algorithm returns probabilities).
抽样是另一种应用于不平衡数据集的常用选项,可以为类分布带来一些平衡.基本上有两种基本方法.
Sampling is another common option applied to imbalanced datasets to bring some balance to the class distributions. There are essentially two fundamental approaches.
欠采样
提取较小的多数实例集并保留少数实例.这将导致较小的数据集,其中类之间的分布更接近;但是,您丢弃了可能很有价值的数据.如果您拥有大量数据,这也很有用.
Extract a smaller set of the majority instances and keep the minority. This will result in a smaller dataset where the distribution between classes is closer; however, you have discarded data that may have been valuable. This could also be beneficial if you have a very large amount of data.
过采样
通过复制来增加少数实例的数量.这将导致更大的数据集保留所有原始数据,但可能会引入偏差.但是,随着大小的增加,您可能也会开始影响计算性能.
Increase the number of minority instances by replicating them. This will result in a larger dataset which retains all the original data but may introduce bias. As you increase the size, however, you may begin to impact computational performance as well.
高级方法
还有其他更复杂"的方法可以帮助解决潜在的偏见.这些方法包括 SMOTE、SMOTEBoost 和 EasyEnsemble 如本关于不平衡数据集和 CSL 的问题.
There are additional methods that are more ‘sophisticated’ to help address potential bias. These include methods such as SMOTE, SMOTEBoost and EasyEnsemble as referenced in this prior question regarding imbalanced datasets and CSL.
关于使用不平衡数据构建模型的另一个注意事项是,您应该牢记模型指标.例如,诸如 F 度量之类的指标不考虑真负率.因此,通常建议在不平衡的设置中使用诸如 Cohen 的 kappa 指标之类的指标.
One further note regarding building models with imbalanced data is that you should keep in mind your model metric. For example, metrics such as F-measures don’t take into account the true negative rate. Therefore, it is often recommended that in imbalanced settings to use metrics such as Cohen’s kappa metric.
这篇关于处理二分类中的类不平衡的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!