本文介绍了如何在scikit-learn LogisticRegression中设置intercept_scaling的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用scikit-learn的LogisticRegression对象进行正则化二进制分类.我已经阅读了intercept_scaling上的文档,但是我不明白如何智能地选择此值.

I am using scikit-learn's LogisticRegression object for regularized binary classification. I've read the documentation on intercept_scaling but I don't understand how to choose this value intelligently.

数据集如下:

  • 10-20个功能,300-500个重复
  • 高度非高斯,实际上大多数观测值为零
  • 输出类不一定相等.在某些情况下,它们接近50/50,在其他情况下,它们更像是90/10.
  • C=0.001通常会提供良好的交叉验证结果.
  • 10-20 features, 300-500 replicates
  • Highly non-Gaussian, in fact most observations are zeros
  • The output classes are not necessarily equally likely. In some cases they are almost 50/50, in other cases they are more like 90/10.
  • Typically C=0.001 gives good cross-validated results.

文档包含以下警告:拦截器本身像其他所有功能一样,也要进行正则化处理,并且可以使用intercept_scaling来解决此问题.但是我应该如何选择这个值?一个简单的答案是探索Cintercept_scaling的许多可能组合,并选择提供最佳性能的参数.但是此参数搜索将花费一些时间,如果可能的话,我想避免这种情况.

The documentation contains warnings that the intercept itself is subject to regularization, like every other feature, and that intercept_scaling can be used to address this. But how should I choose this value? One simple answer is to explore many possible combinations of C and intercept_scaling and choose the parameters that give the best performance. But this parameter search will take quite a while and I'd like to avoid that if possible.

理想情况下,我想使用截距来控制输出预测的分布.也就是说,我想确保分类器在训练集上预测等级1"的概率等于训练集中等级1"数据的比例.我知道在某些情况下是这种情况,但我的数据中并非如此.我不知道这是由于输入数据的正则化还是非高斯性质.

Ideally, I would like to use the intercept to control the distribution of output predictions. That is, I would like to ensure that the probability that the classifier predicts "class 1" on the training set is equal to the proportion of "class 1" data in the training set. I know that this is the case under certain circumstances, but this is not the case in my data. I don't know if it's due to the regularization or to the non-Gaussian nature of the input data.

谢谢您的建议!

推荐答案

虽然您尝试通过设置class_weight="auto"对正类进行过采样?有效地高估了代表性不足的类别,而低估了多数类别.

While you tried oversampling the positive class by setting class_weight="auto"? That effectively oversamples the underrepresented classes and undersamples the majority class.

(当前稳定的文档有点令人困惑,因为它们似乎是从SVC复制粘贴而未针对LR编辑的;只是在最新版本中有所更改.)

(The current stable docs are a bit confusing since they seem to have been copy-pasted from SVC and not edited for LR; that's just changed in the bleeding edge version.)

这篇关于如何在scikit-learn LogisticRegression中设置intercept_scaling的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 11:08
查看更多