问题描述
我在R,python statmodels和sklearn中进行了逻辑回归的一些实验.尽管R和statmodels给出的结果一致,但是与sklearn返回的结果存在一些差异.我想了解为什么这些结果会有所不同.我了解到,这可能与在后台使用的优化算法不同.
I have made some experiments with logistic regression in R, python statmodels and sklearn. While the results given by R and statmodels agree, there is some discrepency with what is returned by sklearn. I would like to understand why these results are different.I understand that it is probably not the same optimization algorithms that are used under the wood.
具体来说,我使用标准的Default
数据集(在 ISL书中使用).以下Python代码将数据读取到数据帧Default
中.
Specifically, I use the standard Default
dataset (used in the ISL book). The following Python code reads the data into a dataframe Default
.
import pandas as pd
# data is available here
Default = pd.read_csv('https://d1pqsl2386xqi9.cloudfront.net/notebooks/Default.csv', index_col=0)
#
Default['default']=Default['default'].map({'No':0, 'Yes':1})
Default['student']=Default['student'].map({'No':0, 'Yes':1})
#
I=Default['default']==0
print("Number of 'default' values :", Default[~I]['balance'].count())
默认"值的数量:333.
Number of 'default' values : 333.
总共有10000个示例,只有333个阳性
There is a total of 10000 examples, with only 333 positives
我使用以下
library("ISLR")
data(Default,package='ISLR')
#write.csv(Default,"default.csv")
glm.out=glm('default~balance+income+student', family=binomial, data=Default)
s=summary(glm.out)
print(s)
#
glm.probs=predict(glm.out,type="response")
glm.probs[1:5]
glm.pred=ifelse(glm.probs>0.5,"Yes","No")
#attach(Default)
t=table(glm.pred,Default$default)
print(t)
score=mean(glm.pred==Default$default)
print(paste("score",score))
结果如下
残差残差: 最低1Q中位数3Q最高
-2.4691 -0.1418 -0.0557 -0.0203 3.7383
Deviance Residuals: Min 1Q Median 3Q Max
-2.4691 -0.1418 -0.0557 -0.0203 3.7383
系数:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.087e+01 4.923e-01 -22.080 < 2e-16
balance 5.737e-03 2.319e-04 24.738 < 2e-16
income 3.033e-06 8.203e-06 0.370 0.71152
studentYes -6.468e-01 2.363e-01 -2.738 0.00619
(二项式族的色散参数取为1)
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2920.6 on 9999 degrees of freedom Residual
距离:9996自由度上的1571.5 AIC:1579.5
deviance: 1571.5 on 9996 degrees of freedom AIC: 1579.5
Fisher计分迭代次数:8
Number of Fisher Scoring iterations: 8
glm.pred No Yes
No 9627 228
Yes 40 105
1 得分0.9732"
1 "score 0.9732"
我懒得剪切和粘贴使用statmodels获得的结果.可以说它们与R给出的那些极其相似.
I am too lazy to cut and paste the results obtained with statmodels. It suffice to say that they are extremely similar to those given by R.
对于sklearn,我运行了以下代码.
For sklearn, I ran the following code.
- 有一个用于考虑不平衡类的参数class_weight.我测试了class_weight = None(不加权-我认为这是R中的默认设置),并且class_weight ='auto'(使用数据中的反向频率进行加权)
- 我也将C = 10000(正则化参数的倒数),以最大程度地减少正则化的影响.
~~
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
features = Default[[ 'balance', 'income' ]]
target = Default['default']
#
for weight in (None, 'auto'):
print("*"*40+"\nweight:",weight)
classifier = LogisticRegression(C=10000, class_weight=weight, random_state=42)
#C=10000 ~ no regularization
classifier.fit(features, target,) #fit classifier on whole base
print("Intercept", classifier.intercept_)
print("Coefficients", classifier.coef_)
y_true=target
y_pred_cls=classifier.predict_proba(features)[:,1]>0.5
C=confusion_matrix(y_true,y_pred_cls)
score=(C[0,0]+C[1,1])/(C[0,0]+C[1,1]+C[0,1]+C[1,0])
precision=(C[1,1])/(C[1,1]+C[0 ,1])
recall=(C[1,1])/(C[1,1]+C[1,0])
print("\n Confusion matrix")
print(C)
print()
print('{s:{c}<{n}}{num:2.4}'.format(s='Score',n=15,c='', num=score))
print('{s:{c}<{n}}{num:2.4}'.format(s='Precision',n=15,c='', num=precision))
print('{s:{c}<{n}}{num:2.4}'.format(s='Recall',n=15,c='', num=recall))
结果在下面给出.
> ****************************************
>weight: None
>
>Intercept [ -1.94164126e-06]
>
>Coefficients [[ 0.00040756 -0.00012588]]
>
> Confusion matrix
>
> [[9664 3]
> [ 333 0]]
>
> Score 0.9664
> Precision 0.0
> Recall 0.0
>
> ****************************************
>weight: auto
>
>Intercept [-8.15376429]
>
>Coefficients
>[[ 5.67564834e-03 1.95253338e-05]]
>
> Confusion matrix
>
> [[8356 1311]
> [ 34 299]]
>
> Score 0.8655
> Precision 0.1857
> Recall 0.8979
我观察到的是,对于class_weight=None
,得分"非常好,但可以识别出否的正面例子.精度和召回率均为零.发现的系数非常小,尤其是截距.修改C不会改变任何事情.对于class_weight='auto'
来说,情况似乎更好,但是我仍然具有非常低的精度(过多的正分类).同样,更改C并没有帮助.如果我手动修改截距,则可以恢复R给出的结果.因此,我怀疑这两种情况下的截距的估计之间存在差异.由于这会影响到三位点的规格(模拟重采样的模拟结果),因此可以解释性能上的差异.
What I observe is that for class_weight=None
, the Score is excellent but no positive example is recognized. Precision and recall are at zero. The coefficients found are very small, particularly the intercept. Modifying C does not change things.For class_weight='auto'
things seems better but I still have a precision which is very low (too much positive classified).Again, changing C does not help. If I modify the intercept by hand, I can recover the results given by R. So I suspect that here is a discrepency between the estimation of the intecepts in the two cases. As this has a consequence in the specification of the threeshold (analog to a resampling of pulations), this can explain the differences in performances.
但是,我欢迎在两种解决方案之间进行选择的任何建议,并帮助您了解这些差异的由来.谢谢.
However, I would welcome any advice for the choice between the two solutions and help to understand the origin of these differences. Thanks.
推荐答案
我遇到了类似的问题,并最终在/r/MachineLearning上发布有关它的信息.事实证明,差异可以归因于数据标准化.如果将数据标准化,则scikit-learn所使用的任何方法来查找模型的参数都将产生更好的结果. scikit-learn有一些讨论预处理数据(包括标准化)的文档,可以在此处.
I ran into a similar issue and ended up posting about it on /r/MachineLearning. It turns out the difference can be attributed to data standardization. Whatever approach scikit-learn is using to find the parameters of the model will yield better results if the data is standardized. scikit-learn has some documentation discussing preprocessing data (including standardization), which can be found here.
Number of 'default' values : 333
Intercept: [-6.12556565]
Coefficients: [[ 2.73145133 0.27750788]]
Confusion matrix
[[9629 38]
[ 225 108]]
Score 0.9737
Precision 0.7397
Recall 0.3243
代码
# scikit-learn vs. R
# http://stackoverflow.com/questions/28747019/comparison-of-r-statmodels-sklearn-for-a-classification-task-with-logistic-reg
import pandas as pd
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn import preprocessing
# Data is available here.
Default = pd.read_csv('https://d1pqsl2386xqi9.cloudfront.net/notebooks/Default.csv', index_col = 0)
Default['default'] = Default['default'].map({'No':0, 'Yes':1})
Default['student'] = Default['student'].map({'No':0, 'Yes':1})
I = Default['default'] == 0
print("Number of 'default' values : {0}".format(Default[~I]['balance'].count()))
feats = ['balance', 'income']
Default[feats] = preprocessing.scale(Default[feats])
# C = 1e6 ~ no regularization.
classifier = LogisticRegression(C = 1e6, random_state = 42)
classifier.fit(Default[feats], Default['default']) #fit classifier on whole base
print("Intercept: {0}".format(classifier.intercept_))
print("Coefficients: {0}".format(classifier.coef_))
y_true = Default['default']
y_pred_cls = classifier.predict_proba(Default[feats])[:,1] > 0.5
confusion = confusion_matrix(y_true, y_pred_cls)
score = float((confusion[0, 0] + confusion[1, 1])) / float((confusion[0, 0] + confusion[1, 1] + confusion[0, 1] + confusion[1, 0]))
precision = float((confusion[1, 1])) / float((confusion[1, 1] + confusion[0, 1]))
recall = float((confusion[1, 1])) / float((confusion[1, 1] + confusion[1, 0]))
print("\nConfusion matrix")
print(confusion)
print('\n{s:{c}<{n}}{num:2.4}'.format(s = 'Score', n = 15, c = '', num = score))
print('{s:{c}<{n}}{num:2.4}'.format(s = 'Precision', n = 15, c = '', num = precision))
print('{s:{c}<{n}}{num:2.4}'.format(s = 'Recall', n = 15, c = '', num = recall))
这篇关于具有逻辑回归的分类任务的R,statmodels,sklearn的比较的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!