问题描述
我正在使用 UCI的克利夫兰心脏病数据集进行分类,但我不理解 target 属性.
I’m using the Cleveland Heart Disease dataset from UCI for classification but i don’t understand the target attribute.
数据集描述说值从0到4,而属性描述说:
The dataset description says that the values go from 0 to 4 but the attribute description says:
1:> 50%的冠心病
1: > 50% coronary disease
我想知道如何解释这个问题,这个数据集意味着是多类问题还是二进制分类问题?我是否必须将值1-4归为一类(疾病存在)?
I’d like to know how to interpret this, is this dataset meant to be a multiclass or a binary classification problem? And must i group values 1-4 to a single class (presence of disease)?
推荐答案
如果您正在处理不平衡的数据集,则应使用重新采样技术以获得更好的结果.如果数据集不平衡,分类器将始终预测"最常见的类,而无需对特征进行任何分析.
If you are working on imbalanced dataset, you should use re-sampling technique to get better results. In case of imbalanced datasets the classifier always "predicts" the most common class without performing any analysis of the features.
您应该尝试SMOTE,它是基于少数族群的综合元素.它可以从少数族裔中随机选择一个点,并为此点计算k个最近邻.
You should try SMOTE, it's synthesizing elements for the minority class, based on those that already exist. It works randomly picking a point from the minority class and computing the k-nearest neighbors for this point.
我还使用了交叉验证K折方法和SMOTE,交叉验证可确保模型从数据中获取正确的模式.
I also used cross validation K-fold method along with SMOTE, Cross validation assures that model gets the correct patterns from the data.
在测量模型的性能时,准确性度量会产生误导,即使有更多的误报,它也显示出很高的准确性.使用F1得分和我的客户中心"之类的指标.
While measuring the performance of model, accuracy metric mislead, its shows high accuracy even though there are more False Positive. Use metric such as F1-score and MCC.
参考文献:
https://www.kaggle.com/rafjaa/resampling不平衡数据集的战略
这篇关于克利夫兰心脏病数据集-无法描述课程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!