为什么auc与sklearn和R的逻辑回归如此不同

本文介绍了为什么auc与sklearn和R的逻辑回归如此不同的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用相同的数据集来训练R和python sklearn中的逻辑回归模型.数据集不平衡.而且我发现auc完全不同.这是python的代码:

I use a same dataset to train logistic regression model both in R and python sklearn. The dataset is unbalanced.And I find that the auc is quite different.This is the code of python:

model_logistic = linear_model.LogisticRegression() #auc 0.623
model_logistic.fit(train_x, train_y)
pred_logistic = model_logistic.predict(test_x) #mean:0.0235 var:0.023
print "logistic auc: ", sklearn.metrics.roc_auc_score(test_y,pred_logistic)

这是R的代码:

glm_fit <- glm(label ~ watch_cnt_7 + bid_cnt_7 + vi_cnt_itm_1 +
ITEM_PRICE  + add_to_cart_cnt_7 + offer_cnt_7 +
 dwell_dlta_4to2 +
 vi_cnt_itm_2 + asq_cnt_7 + watch_cnt_14to7 + dwell_dlta_6to4 +
auct_type + vi_cnt_itm_3 + vi_cnt_itm_7 + vi_dlta_4to2 +
vi_cnt_itm_4 + vi_dlta_6to4 + tenure + sum_SRCH_item_7 +
vi_cnt_itm_6 + dwell_itm_3 +
offer_cnt_14to7 + #
dwell_itm_2 + dwell_itm_6 + CNDTN_ROLLUP_ID +
dwell_itm_5 + dwell_itm_4 + dwell_itm_1+
bid_cnt_14to7 + item_prchsd_cnt_14to7 +  #
dwell_itm_7  + median_day_rate + vb_ratio
, data = train, family=binomial())
p_lm<-predict(glm_fit, test[1:nc-1],type = "response" )
pred_lm <- prediction(p_lm,test$label)
auc <- performance(pred_lm,'auc')@y.values

python的auc为0.623，而R为0.887.所以我想知道sklearn逻辑回归有什么问题以及如何解决.谢谢.

The auc of python is 0.623 while the R is 0.887.So I want to know what's wrong with sklearn logistic regression and how to fix it. Thanks.

推荐答案

在python脚本中，您应该使用predict_proba来获取两个类别的概率估计，并将正类别的第二列用作roc_auc_score因为ROC曲线是通过改变概率阈值绘制的.

In the python script, you should use predict_proba to get the probability estimates for both classes and take the second column for the positive class as the input for roc_auc_score because ROC curve is plotted by varying the probability thresholds.

pred_logistic = model_logistic.predict_proba(test_x)[:,1]

这篇关于为什么auc与sklearn和R的逻辑回归如此不同的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！