问题描述
我正在使用sklearn在python中进行逻辑回归来解决分类问题。
我的问题是一般性问题。我有一个包含两个类/结果(正/负或1/0)的数据集,但是集合高度不平衡。有〜5%的积极因素和〜95%的不利因素。
我知道有很多方法可以解决这样的不平衡问题,但是还没有找到一个好的方法
到目前为止,我所做的是通过选择具有积极结果和平等条件的条目来建立平衡的训练集。随机选择的否定条目数。然后,我可以将模型训练到该集合上,但是我会停留在如何修改模型以在原始不平衡人口/集合上工作的问题上。
具体步骤要这样做吗?我已经遍历了sklearn文档和示例,但找不到很好的解释。
您尝试过传递给您的 class_weight = auto
分类器?并非sklearn中的所有分类器都支持此功能,但有些分类器支持。检查文档字符串。
此外,您还可以通过随机删除负样本和/或对正样本进行过度采样来重新平衡数据集(可能会增加一些高斯特征噪声)。 / p>
I'm solving a classification problem with sklearn's logistic regression in python.
My problem is a general/generic one. I have a dataset with two classes/result (positive/negative or 1/0), but the set is highly unbalanced. There are ~5% positives and ~95% negatives.
I know there are a number of ways to deal with an unbalanced problem like this, but have not found a good explanation of how to implement properly using the sklearn package.
What I've done thus far is to build a balanced training set by selecting entries with a positive outcome and an equal number of randomly selected negative entries. I can then train the model to this set, but I'm stuck with how to modify the model to then work on the original unbalanced population/set.
What are the specific steps to do this? I've poured over the sklearn documentation and examples and haven't found a good explanation.
Have you tried to pass to your class_weight="auto"
classifier? Not all classifiers in sklearn support this, but some do. Check the docstrings.
Also you can rebalance your dataset by randomly dropping negative examples and / or over-sampling positive examples (+ potentially adding some slight gaussian feature noise).
这篇关于具有不平衡类的sklearn logistic回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!