问题描述
我正在尝试使用 Python 进行负二项式回归.我发现这个例子使用 R 和一个数据集:
I'm experimenting with negative binomial regression using Python. I found this example using R, along with a data set:
http://www.karlin.mff.cuni.cz/~pesta/NMFM404/NB.html
我尝试使用以下代码在网页上复制结果:
I tried to replicate the results on the web page using this code:
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
df = pd.read_stata("http://www.karlin.mff.cuni.cz/~pesta/prednasky/NMFM404/Data/nb_data.dta")
model = smf.glm(formula = "daysabs ~ math + prog", data=df, family=sm.families.NegativeBinomial()).fit()
model.summary()
不幸的是,这并没有给出相同的系数.它给出了以下内容:
Unfortunately this didn't give the same coefficients. It gave the following:
coef std err z P>|z| [95.0% Conf. Int.]
Intercept 3.4875 0.236 14.808 0.000 3.026 3.949
math -0.0067 0.003 -2.600 0.009 -0.012 -0.002
prog -0.6781 0.101 -6.683 0.000 -0.877 -0.479
这些甚至不接近网站上的那些.假设 R 代码是正确的,我做错了什么?
These aren't even close to those on the website. Assuming the R code is correct, what am I doing wrong?
推荐答案
产生差异的原因是,当你使用 Pandas 读取数据集时,prog
变量被视为类型 float
默认:
The reason for the discrepancy is that when you are reading in the dataset with Pandas, the prog
variable is treated as type float
by default:
df.prog.head()
0 2.0
1 2.0
2 2.0
3 2.0
4 2.0
Name: prog, dtype: float32
另一方面,在 R 示例中,prog
变量被显式转换为因子(分类)变量:
In the R example, on the other hand, the prog
variable was explicitly cast to a factor (categorical) variable:
dat <- within(dat, {
prog <- factor(prog, levels = 1:3, labels = c("General", "Academic", "Vocational"))
id <- factor(id)
})
因此,当您查看 R 中的拟合摘要时,您可以看到 prog
变量已被拆分为 n-1 个二进制编码项:
As a result, when you look at the fit summary in R, you can see that the prog
variable has been split into n-1 binary-encoded terms:
> summary(m1 <- glm.nb(daysabs ~ math + prog, data = dat))
Call:
glm.nb(formula = daysabs ~ math + prog, data = dat, init.theta = 1.032713156,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1547 -1.0192 -0.3694 0.2285 2.5273
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.615265 0.197460 13.245 < 2e-16 ***
math -0.005993 0.002505 -2.392 0.0167 *
progAcademic -0.440760 0.182610 -2.414 0.0158 *
progVocational -1.278651 0.200720 -6.370 1.89e-10 ***
将此与 prog
变量在您发布的 Python 拟合摘要中的显示方式进行比较.
Compare this to how the prog
variable appears in the Python fit summary you posted.
要解决此问题,您可以使用 C()
函数 将变量转换为 statsmodels 中的分类变量.这样你会得到相同的结果:
To fix the issue, you can use the C()
function to cast the variable to categorical in statsmodels. This way you will arrive at the same result:
model = smf.glm(formula = "daysabs ~ math + C(prog)", data=df, family=sm.families.NegativeBinomial()).fit()
model.summary()
<class 'statsmodels.iolib.summary.Summary'>
"""
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: daysabs No. Observations: 314
Model: GLM Df Residuals: 310
Model Family: NegativeBinomial Df Model: 3
Link Function: log Scale: 1.06830885199
Method: IRLS Log-Likelihood: -865.68
Date: Thu, 16 Feb 2017 Deviance: 350.98
Time: 10:34:16 Pearson chi2: 331.
No. Iterations: 6
==================================================================================
coef std err z P>|z| [0.025 0.975]
----------------------------------------------------------------------------------
Intercept 2.6150 0.207 12.630 0.000 2.209 3.021
C(prog)[T.2.0] -0.4408 0.192 -2.302 0.021 -0.816 -0.065
C(prog)[T.3.0] -1.2786 0.210 -6.079 0.000 -1.691 -0.866
math -0.0060 0.003 -2.281 0.023 -0.011 -0.001
==================================================================================
"""
这篇关于Python 负二项式回归 - 结果与 R 不匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!