问题描述
我对构建具有不平衡数据的 ML 分类器有点迷茫 (80:20).数据集有 30 列;目标是标签.我想预测专业课.我正在尝试重现以下步骤:
I am a bit lost on building a ML classifier with imbalanced data (80:20). The dataset has 30 columns; the target is Label.I want to predict the major class.I am trying to reproduce the following steps:
- 拆分训练/测试数据
- 在训练集上执行简历
- 仅对测试折叠应用欠采样
- 在 CV 的帮助下选择模型后,对训练集进行欠采样并训练分类器
- 在未触及的测试集上估计性能(召回)
我所做的如下所示:
y = df['Label']
X = df.drop('Label',axis=1)
X.shape, y.shape
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 12)
X_train.shape, X_test.shape
tree = DecisionTreeClassifier(max_depth = 5)
tree.fit(X_train, y_train)
y_test_tree = tree.predict(X_test)
y_train_tree = tree.predict(X_train)
acc_train_tree = accuracy_score(y_train,y_train_tree)
acc_test_tree = accuracy_score(y_test,y_test_tree)
我对如何在训练集上执行 CV、在测试折叠上应用欠采样以及对训练集进行欠采样和训练分类器有一些疑问.这些步骤你熟悉吗?如果是,我将不胜感激.
I have some doubts on how to perform CV on trains set, apply under sampling on a test fold and undersample the train set and train the classifier.Are you familiar with these steps? If you are, I would appreciate your help.
如果我这样做:
y = df['Label']
X = df.drop('Label',axis=1)
X.shape, y.shape
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 12)
X_train.shape, X_test.shape
tree = DecisionTreeClassifier(max_depth = 5)
tree.fit(X_train, y_train)
y_test_tree = tree.predict(X_test)
y_train_tree = tree.predict(X_train)
acc_train_tree = accuracy_score(y_train,y_train_tree)
acc_test_tree = accuracy_score(y_test,y_test_tree)
# CV
scores = cross_val_score(tree,X_train, y_train,cv = 3, scoring = "accuracy")
ypred = cross_val_predict(tree,X_train,y_train,cv = 3)
print(classification_report(y_train,ypred))
accuracy_score(y_train,ypred)
confusion_matrix(y_train,ypred)
我得到这个输出
precision recall f1-score support
-1 0.73 0.99 0.84 291
1 0.00 0.00 0.00 105
accuracy 0.73 396
macro avg 0.37 0.50 0.42 396
weighted avg 0.54 0.73 0.62 396
我想我在上面的代码中遗漏了一些东西或者做错了什么.
I guess I have missed something in the code above or doing something wrong.
数据样本:
Have_0 Have_1 Have_2 Have_letters Label
1 0 1 1 1
0 0 0 1 -1
1 1 1 1 -1
0 1 0 0 1
1 1 0 0 1
1 0 0 1 -1
1 0 0 0 1
推荐答案
通常,创建交叉验证集的最佳方法是模拟测试数据.在您的情况下,如果我们要将您的数据分成 3 组(训练、交叉、测试),最好的方法是创建具有相同比例的真标签/假标签的集合.这就是我在以下函数中所做的.
Generally, the best way to create a cross-validation set is to simulate your test data. In your case, if we are going to divide your data into 3 sets (train, crossv., test), the best way to do it creating sets with the same proportion of true label/false label. That's what I did in the following function.
import numpy as np
import math
X=DF[["Have_0","Have_1","Have_2","Have_letters"]]
y=DF["Label"]
def create_cv(X,y):
if type(X)!=np.ndarray:
X=X.values
y=y.values
test_size=1/5
proportion_of_true=y[y==1].shape[0]/y.shape[0]
num_test_samples=math.ceil(y.shape[0]*test_size)
num_test_true_labels=math.floor(num_test_samples*proportion_of_true)
num_test_false_labels=math.floor(num_test_samples-num_test_true_labels)
y_test=np.concatenate([y[y==0][:num_test_false_labels],y[y==1][:num_test_true_labels]])
y_train=np.concatenate([y[y==0][num_test_false_labels:],y[y==1][num_test_true_labels:]])
X_test=np.concatenate([X[y==0][:num_test_false_labels] ,X[y==1][:num_test_true_labels]],axis=0)
X_train=np.concatenate([X[y==0][num_test_false_labels:],X[y==1][num_test_true_labels:]],axis=0)
return X_train,X_test,y_train,y_test
X_train,X_test,y_train,y_test=create_cv(X,y)
X_train,X_crossv,y_train,y_crossv=create_cv(X_train,y_train)
通过这样做,我们得到了具有以下形状的集合(它们都具有相同比例的真标签/假标签):
By doing so we have sets with the following shapes (which all have the same proportion of true label/false label):
这篇关于测试折叠上的 CV 和欠采样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!