本文介绍了克利夫兰心脏病数据集-无法描述课程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 UCI的克利夫兰心脏病数据集进行分类,但我不理解 target 属性.

I’m using the Cleveland Heart Disease dataset from UCI for classification but i don’t understand the target attribute.

数据集描述说值从0到4,而属性描述说:

The dataset description says that the values go from 0 to 4 but the attribute description says:

1:> 50%的冠心病

1: > 50% coronary disease

我想知道如何解释这个问题,这个数据集意味着是多类问题还是二进制分类问题?我是否必须将值1-4归为一类(疾病存在)?

I’d like to know how to interpret this, is this dataset meant to be a multiclass or a binary classification problem? And must i group values 1-4 to a single class (presence of disease)?

推荐答案

如果您正在处理不平衡的数据集,则应使用重新采样技术以获得更好的结果.如果数据集不平衡,分类器将始终预测"最常见的类,而无需对特征进行任何分析.

If you are working on imbalanced dataset, you should use re-sampling technique to get better results. In case of imbalanced datasets the classifier always "predicts" the most common class without performing any analysis of the features.

您应该尝试SMOTE,它是基于少数族群的综合元素.它可以从少数族裔中随机选择一个点,并为此点计算k个最近邻.

You should try SMOTE, it's synthesizing elements for the minority class, based on those that already exist. It works randomly picking a point from the minority class and computing the k-nearest neighbors for this point.

我还使用了交叉验证K折方法和SMOTE,交叉验证可确保模型从数据中获取正确的模式.

I also used cross validation K-fold method along with SMOTE, Cross validation assures that model gets the correct patterns from the data.

在测量模型的性能时,准确性度量会产生误导,即使有更多的误报,它也显示出很高的准确性.使用F1得分和我的客户中心"之类的指标.

While measuring the performance of model, accuracy metric mislead, its shows high accuracy even though there are more False Positive. Use metric such as F1-score and MCC.

参考文献:

https://www.kaggle.com/rafjaa/resampling不平衡数据集的战略

这篇关于克利夫兰心脏病数据集-无法描述课程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-13 20:02