machine-learning - LogisticRegressionCV错误地预测标签

我有4个连续变量x_1至x_4，每个变量通过原始数据的最小-最大缩放分布在[0，1]范围内。我正在使用LogisticRegression（）将类的标签预测为“ 1”或“ 0”。

什么不起作用？好吧，我的LogisticRegression（）预测所有分类为“ 1”类型。

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=0)
for train_indices, test_indices in split.split(numerical_data, y):
    x_train = numerical_data[train_indices]
    y_train = y[train_indices]
    x_test  = numerical_data[test_indices]
    y_test  = y[test_indices]
reg = LogisticRegression()
reg.fit(x_train, y_train)
y_pred = reg.predict(x_test)
print(classification_report_without_support(y_test, y_pred))

我有以下问题

LogisticRegression是适合此工作的工具吗？因为它可以很好地处理一键编码的数据。
它可以处理连续数据吗？大概吧。
我为LogisticRegression设置的任何参数是否错误？你能提出更好或更整洁的建议吗？
最后，我做错什么了吗？

输出量


              precision    recall  f1-score

           0       0.00      0.00      0.00
           1       0.90      1.00      0.95

    accuracy                           0.90
   macro avg       0.45      0.50      0.47
weighted avg       0.80      0.90      0.85

UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

SMOTE + same settings for LogisticRegressionCV

              precision    recall  f1-score

           0       0.63      0.73      0.67
           1       0.68      0.57      0.62

    accuracy                           0.65
   macro avg       0.65      0.65      0.65
weighted avg       0.65      0.65      0.65

使用LogisticRegression的SMOTE代码。

os = SMOTE(random_state=0)
x_train, x_test, y_train, y_test = train_test_split(numerical_data, y, test_size=0.2, random_state=0)

os_data_x, os_data_y = os.fit_sample(x_train, y_train)
os_data_X = pd.DataFrame(data=os_data_x,columns=['x1', 'x2', 'x3', 'x4'] )
os_data_Y = pd.DataFrame(data=os_data_y,columns=['y'])

x_train, x_test, y_train, y_test = train_test_split(os_data_X, os_data_Y.values.ravel(), test_size=0.2, random_state=0)

reg.fit(x_train, y_train)

y_pred = reg.predict(x_test)
print(classification_report_without_support(y_test, y_pred))

Accuracy of classifier on test set: 0.71

              precision    recall  f1-score

           0       0.14      0.70      0.24
           1       0.95      0.57      0.71

    accuracy                           0.58
   macro avg       0.55      0.63      0.47
weighted avg       0.87      0.58      0.67

最佳答案

您的数据似乎不平衡，从精度召回表中我们可以看到，类1贡献了接近总数据的90%。解决类不平衡问题的方法有多种，您可以参考此blog以获得详细的解决方案。

解决此问题的一种快速解决方案是将类权重添加到模型中（到目前为止，这是代码中的默认值None），这基本上意味着，当模型在预测中出错时，您将对模型进行更多的惩罚类别0比类别1。首先，您可以将类权重值从None更改为balanced，然后查看其效果。

但是同时，您应该注意，增加类权重也会影响类1的性能，这基本上是您需要权衡的折衷方案。

希望这可以帮助！

关于machine-learning - LogisticRegressionCV错误地预测标签，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/59504289/