问题描述
我正在做Logistic回归,这本书由James,Witten,Hastie,Tibshirani(2013)撰写的《 R语言中的统计学习及其应用入门》一书描述.
更具体地说,我正在将二进制分类模型拟合到第7.8.1节中描述的R包"ISLR"中的工资"数据集.
将预测变量年龄"(转换为多项式,等级4)与二进制分类工资> 250拟合.然后将年龄与真实"值的预测概率作图.
R中的模型拟合如下:
fit=glm(I(wage>250)~poly(age,4),data=Wage, family=binomial)
agelims=range(age)
age.grid=seq(from=agelims[1],to=agelims[2])
preds=predict(fit,newdata=list(age=age.grid),se=T)
pfit=exp(preds$fit)/(1+exp(preds$fit))
完整代码(作者网站): http://www-bcf.usc.edu/~gareth/ISL/Chapter%207%20Lab.txt
本书中的相应图: http://www-bcf.usc .edu/〜gareth/ISL/Chapter7/7.1.pdf (右)
我试图在scikit-learn中将模型拟合为相同的数据:
poly = PolynomialFeatures(4)
X = poly.fit_transform(df.age.reshape(-1,1))
y = (df.wage > 250).map({False:0, True:1}).as_matrix()
clf = LogisticRegression()
clf.fit(X,y)
X_test = poly.fit_transform(np.arange(df.age.min(), df.age.max()).reshape(-1,1))
prob = clf.predict_proba(X_test)
然后,我针对年龄范围绘制了真实"值的概率.但是结果/图看起来很不一样. (不是在谈论CI波段或rugplot,而是在谈论概率图.)我在这里错过了什么吗?
更多阅读后,我了解到scikit-learn实现了正则化的Logistic回归模型,而R中的glm未实现正则化. Statsmodels的GLM实现(python)未进行规范,其结果与R中相同.
R包LiblineaR与scikit-learn的逻辑回归(使用'liblinear'求解器时)类似.
https://cran.r-project.org/web/packages/LiblineaR /
I am doing a Logistic Regression described in the book 'An Introduction to Statistical Learning with Applications in R' by James, Witten, Hastie, Tibshirani (2013).
More specifically, I am fitting the binary classification model to the 'Wage' dataset from the R package 'ISLR' described in §7.8.1.
Predictor 'age' (transformed to polynomial, degree 4) is fitted against the binary classification wage>250. Then the age is plotted against the predicted probabilities of the 'True' value.
The model in R is fit as follows:
fit=glm(I(wage>250)~poly(age,4),data=Wage, family=binomial)
agelims=range(age)
age.grid=seq(from=agelims[1],to=agelims[2])
preds=predict(fit,newdata=list(age=age.grid),se=T)
pfit=exp(preds$fit)/(1+exp(preds$fit))
Complete code (author's site): http://www-bcf.usc.edu/~gareth/ISL/Chapter%207%20Lab.txt
The corresponding plot from the book: http://www-bcf.usc.edu/~gareth/ISL/Chapter7/7.1.pdf (right)
I tried to fit a model to the same data in scikit-learn:
poly = PolynomialFeatures(4)
X = poly.fit_transform(df.age.reshape(-1,1))
y = (df.wage > 250).map({False:0, True:1}).as_matrix()
clf = LogisticRegression()
clf.fit(X,y)
X_test = poly.fit_transform(np.arange(df.age.min(), df.age.max()).reshape(-1,1))
prob = clf.predict_proba(X_test)
I then plotted probabilities of the 'True' values against the age range. But the result/plot looks quite different. (Not talking about the CI bands or rugplot, just the probability plot.) Am I missing something here?
After some more reading I understand that scikit-learn implements a regularized logistic regression model, whereas glm in R is not regularized. Statsmodels' GLM implementation (python) is unregularized and gives identical results as in R.
The R package LiblineaR is similar to scikit-learn's logistic regression (when using 'liblinear' solver).
https://cran.r-project.org/web/packages/LiblineaR/
这篇关于具有逻辑回归的分类任务的R和scikit学习的比较的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!