我使用smf.olssm.OLS函数在系数值和系数误差上有差异。即使是配对的,它们也应该是相同的回归公式并给出相同的结果。
我已经做了一个100%重复的例子,我的问题,数据帧df可以从这里下载:https://drive.google.com/drive/folders/1i67wztkrAeEZH2tv2hyOlgxG7N80V3pI?usp=sharing
案例1:使用来自Statsmodels的Patsy的线性模型

# First we load the libraries:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import random
import pandas as pd
# We define a specific seed to have the same results:
random.seed(1234)
# Now we read the data that can be downloaded from Google Drive link provided above:
df = pd.read_csv("/Users/user/Documents/example/cars.csv", sep = "|")
# We create the linear regression:
lm1 = smf.ols('price ~ make + fuel_system + engine_type + num_of_doors + bore + compression_ratio + height + peak_rpm + 1', data = df)
# We see the results:
lm1.fit().summary()

lm1的结果是:
                            OLS Regression Results
==============================================================================
Dep. Variable:                  price   R-squared:                       0.894
Model:                            OLS   Adj. R-squared:                  0.868
Method:                 Least Squares   F-statistic:                     35.54
Date:                Mon, 18 Feb 2019   Prob (F-statistic):           5.24e-62
Time:                        17:19:14   Log-Likelihood:                -1899.7
No. Observations:                 205   AIC:                             3879.
Df Residuals:                     165   BIC:                             4012.
Df Model:                          39
Covariance Type:            nonrobust
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept              1.592e+04   1.21e+04      1.320      0.189   -7898.396    3.97e+04
make[T.audi]           6519.7045   2371.807      2.749      0.007    1836.700    1.12e+04
make[T.bmw]            1.427e+04   2292.551      6.223      0.000    9740.771    1.88e+04
make[T.chevrolet]      -571.8236   2860.026     -0.200      0.842   -6218.788    5075.141
make[T.dodge]         -1186.3430   2261.240     -0.525      0.601   -5651.039    3278.353
make[T.honda]          2779.6496   2891.626      0.961      0.338   -2929.709    8489.009
make[T.isuzu]          3098.9677   2592.645      1.195      0.234   -2020.069    8218.004
make[T.jaguar]         1.752e+04   2416.313      7.252      0.000    1.28e+04    2.23e+04
make[T.mazda]           306.6568   2134.567      0.144      0.886   -3907.929    4521.243
make[T.mercedes-benz]  1.698e+04   2320.871      7.318      0.000    1.24e+04    2.16e+04
make[T.mercury]        2958.1002   3605.739      0.820      0.413   -4161.236    1.01e+04
make[T.mitsubishi]    -1188.8337   2284.697     -0.520      0.604   -5699.844    3322.176
make[T.nissan]        -1211.5463   2073.422     -0.584      0.560   -5305.405    2882.312
make[T.peugot]         3057.0217   4255.809      0.718      0.474   -5345.841    1.15e+04
make[T.plymouth]       -894.5921   2332.746     -0.383      0.702   -5500.473    3711.289
make[T.porsche]        9558.8747   3688.038      2.592      0.010    2277.044    1.68e+04
make[T.renault]       -2124.9722   2847.536     -0.746      0.457   -7747.277    3497.333
make[T.saab]           3490.5333   2319.189      1.505      0.134   -1088.579    8069.645
make[T.subaru]        -1.636e+04   4002.796     -4.087      0.000   -2.43e+04   -8456.659
make[T.toyota]         -770.9677   1911.754     -0.403      0.687   -4545.623    3003.688
make[T.volkswagen]      406.9179   2219.714      0.183      0.855   -3975.788    4789.623
make[T.volvo]          5433.7129   2397.030      2.267      0.025     700.907    1.02e+04
fuel_system[T.2bbl]    2142.1594   2232.214      0.960      0.339   -2265.226    6549.545
fuel_system[T.4bbl]     464.1109   3999.976      0.116      0.908   -7433.624    8361.846
fuel_system[T.idi]     1.991e+04   6622.812      3.007      0.003    6837.439     3.3e+04
fuel_system[T.mfi]     3716.5201   3936.805      0.944      0.347   -4056.488    1.15e+04
fuel_system[T.mpfi]    3964.1109   2267.538      1.748      0.082    -513.019    8441.241
fuel_system[T.spdi]    3240.0003   2719.925      1.191      0.235   -2130.344    8610.344
fuel_system[T.spfi]     932.1959   4019.476      0.232      0.817   -7004.041    8868.433
engine_type[T.dohcv]  -1.208e+04   4205.826     -2.872      0.005   -2.04e+04   -3773.504
engine_type[T.l]      -4833.9860   3763.812     -1.284      0.201   -1.23e+04    2597.456
engine_type[T.ohc]    -4038.8848   1213.598     -3.328      0.001   -6435.067   -1642.702
engine_type[T.ohcf]    9618.9281   3504.600      2.745      0.007    2699.286    1.65e+04
engine_type[T.ohcv]    3051.7629   1445.185      2.112      0.036     198.323    5905.203
engine_type[T.rotor]   1403.9928   3217.402      0.436      0.663   -4948.593    7756.579
num_of_doors[T.two]    -419.9640    521.754     -0.805      0.422   -1450.139     610.211
bore                   3993.4308   1373.487      2.908      0.004    1281.556    6705.306
compression_ratio     -1200.5665    460.681     -2.606      0.010   -2110.156    -290.977
height                  -80.7141    146.219     -0.552      0.582    -369.417     207.988
peak_rpm                 -0.5903      0.790     -0.747      0.456      -2.150       0.970
==============================================================================
Omnibus:                       65.777   Durbin-Watson:                   1.217
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              399.594
Skew:                           1.059   Prob(JB):                     1.70e-87
Kurtosis:                       9.504   Cond. No.                     3.26e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.26e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
"""

案例2:使用Statsmodels中的虚拟变量的线性模型
# We define a specific seed to have the same results:
random.seed(1234)
# First we check what `object` type variables we have in our dataset:
df.dtypes
# We create a list where we save the `object` type variables names:
object = ['make',
          'fuel_system',
          'engine_type',
          'num_of_doors'
          ]
# Now we convert those object variables to numeric with get_dummies function to have 1 unique numeric dataframe:
df_num = pd.get_dummies(df, columns = object)
# We ensure the dataframe is numeric casting all values to float64:
df_num = df_num[df_num.columns].apply(pd.to_numeric, errors='coerce', axis = 1)
# We define the predictive variables dataset:
X = df_num.drop('price', axis = 1)
# We define the response variable values:
y = df_num.price.values
# We add a constant as we did in the previous example (adding "+1" to Patsy):
Xc = sm.add_constant(X) # Adds a constant to the model
# We create the linear model and obtain results:
lm2 = sm.OLS(y, Xc)
lm2.fit().summary()

lm2的结果是:
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.894
Model:                            OLS   Adj. R-squared:                  0.868
Method:                 Least Squares   F-statistic:                     35.54
Date:                Mon, 18 Feb 2019   Prob (F-statistic):           5.24e-62
Time:                        17:28:16   Log-Likelihood:                -1899.7
No. Observations:                 205   AIC:                             3879.
Df Residuals:                     165   BIC:                             4012.
Df Model:                          39
Covariance Type:            nonrobust
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
const               1.205e+04   6811.094      1.769      0.079   -1398.490    2.55e+04
bore                3993.4308   1373.487      2.908      0.004    1281.556    6705.306
compression_ratio  -1200.5665    460.681     -2.606      0.010   -2110.156    -290.977
height               -80.7141    146.219     -0.552      0.582    -369.417     207.988
peak_rpm              -0.5903      0.790     -0.747      0.456      -2.150       0.970
make_alfa-romero   -2273.9631   1865.185     -1.219      0.225   -5956.669    1408.743
make_audi           4245.7414   1324.140      3.206      0.002    1631.299    6860.184
make_bmw            1.199e+04   1232.635      9.730      0.000    9559.555    1.44e+04
make_chevrolet     -2845.7867   1976.730     -1.440      0.152   -6748.733    1057.160
make_dodge         -3460.3061   1170.966     -2.955      0.004   -5772.315   -1148.297
make_honda           505.6865   2049.865      0.247      0.805   -3541.661    4553.034
make_isuzu           825.0045   1706.160      0.484      0.629   -2543.716    4193.725
make_jaguar         1.525e+04   1903.813      8.010      0.000    1.15e+04     1.9e+04
make_mazda         -1967.3063    982.179     -2.003      0.047   -3906.564     -28.048
make_mercedes-benz  1.471e+04   1423.004     10.338      0.000    1.19e+04    1.75e+04
make_mercury         684.1370   2913.361      0.235      0.815   -5068.136    6436.410
make_mitsubishi    -3462.7968   1221.018     -2.836      0.005   -5873.631   -1051.963
make_nissan        -3485.5094    946.316     -3.683      0.000   -5353.958   -1617.060
make_peugot          783.0586   3513.296      0.223      0.824   -6153.754    7719.871
make_plymouth      -3168.5552   1293.376     -2.450      0.015   -5722.256    -614.854
make_porsche        7284.9115   2853.174      2.553      0.012    1651.475    1.29e+04
make_renault       -4398.9354   2037.945     -2.159      0.032   -8422.747    -375.124
make_saab           1216.5702   1487.192      0.818      0.415   -1719.810    4152.950
make_subaru        -1.863e+04   3263.524     -5.710      0.000   -2.51e+04   -1.22e+04
make_toyota        -3044.9308    776.059     -3.924      0.000   -4577.218   -1512.644
make_volkswagen    -1867.0452   1170.975     -1.594      0.113   -4179.072     444.981
make_volvo          3159.7498   1327.405      2.380      0.018     538.862    5780.638
fuel_system_1bbl   -2790.4092   2230.161     -1.251      0.213   -7193.740    1612.922
fuel_system_2bbl    -648.2498   1094.525     -0.592      0.554   -2809.330    1512.830
fuel_system_4bbl   -2326.2983   3094.703     -0.752      0.453   -8436.621    3784.024
fuel_system_idi     1.712e+04   6154.806      2.782      0.006    4971.083    2.93e+04
fuel_system_mfi      926.1109   3063.134      0.302      0.763   -5121.881    6974.102
fuel_system_mpfi    1173.7017   1186.125      0.990      0.324   -1168.238    3515.642
fuel_system_spdi     449.5911   1827.318      0.246      0.806   -3158.349    4057.531
fuel_system_spfi   -1858.2133   3111.596     -0.597      0.551   -8001.891    4285.464
engine_type_dohc    2703.6445   1803.080      1.499      0.136    -856.440    6263.729
engine_type_dohcv  -9374.0342   3504.717     -2.675      0.008   -1.63e+04   -2454.161
engine_type_l      -2130.3416   3357.283     -0.635      0.527   -8759.115    4498.431
engine_type_ohc    -1335.2404   1454.047     -0.918      0.360   -4206.177    1535.696
engine_type_ohcf    1.232e+04   2850.883      4.322      0.000    6693.659     1.8e+04
engine_type_ohcv    5755.4074   1669.627      3.447      0.001    2458.820    9051.995
engine_type_rotor   4107.6373   3032.223      1.355      0.177   -1879.323    1.01e+04
num_of_doors_four   6234.8048   3491.722      1.786      0.076    -659.410    1.31e+04
num_of_doors_two    5814.8408   3337.588      1.742      0.083    -775.045    1.24e+04
==============================================================================
Omnibus:                       65.777   Durbin-Watson:                   1.217
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              399.594
Skew:                           1.059   Prob(JB):                     1.70e-87
Kurtosis:                       9.504   Cond. No.                     1.01e+16
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 5.38e-23. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
"""

如我们所见,一些变量如statsmodels具有相同的系数。然而,其他一些则没有(变量height中的levelisuzumake中的levelohcengine_type等)。两个输出的结果不应该相同吗?我在这里遗漏了什么或做错了什么?
提前谢谢你的帮助。
P.D.如@sukhbinder所述,即使使用不独立的Patsy公式
术语(在公式中输入“-1”,因为patsy通过
默认)并从虚拟公式中删除独立项,i
收到不同的结果。

最佳答案

结果不匹配的原因是Statsmodels根据高多重共线性对预测变量进行预选择。
完全相同的结果通过对回归进行描述性总结并识别缺失的变量:

deletex = [
        'make_alfa-romero',
        'fuel_system_1bbl',
        'engine_type_dohc',
        'num_of_doors_four'
        ]
df_num.drop( deletex, axis = 1, inplace = True)
df_num = df_num[df_num.columns].apply(pd.to_numeric, errors='coerce', axis = 1)
X = df_num.drop('price', axis = 1)
y = df_num.price.values
Xc = sm.add_constant(X) # Adds a constant to the model
random.seed(1234)
linear_regression = sm.OLS(y, Xc)
linear_regression.fit().summary()

打印结果:
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.894
Model:                            OLS   Adj. R-squared:                  0.868
Method:                 Least Squares   F-statistic:                     35.54
Date:                Thu, 21 Feb 2019   Prob (F-statistic):           5.24e-62
Time:                        18:16:08   Log-Likelihood:                -1899.7
No. Observations:                 205   AIC:                             3879.
Df Residuals:                     165   BIC:                             4012.
Df Model:                          39
Covariance Type:            nonrobust
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
const               1.592e+04   1.21e+04      1.320      0.189   -7898.396    3.97e+04
bore                3993.4308   1373.487      2.908      0.004    1281.556    6705.306
compression_ratio  -1200.5665    460.681     -2.606      0.010   -2110.156    -290.977
height               -80.7141    146.219     -0.552      0.582    -369.417     207.988
peak_rpm              -0.5903      0.790     -0.747      0.456      -2.150       0.970
make_audi           6519.7045   2371.807      2.749      0.007    1836.700    1.12e+04
make_bmw            1.427e+04   2292.551      6.223      0.000    9740.771    1.88e+04
make_chevrolet      -571.8236   2860.026     -0.200      0.842   -6218.788    5075.141
make_dodge         -1186.3430   2261.240     -0.525      0.601   -5651.039    3278.353
make_honda          2779.6496   2891.626      0.961      0.338   -2929.709    8489.009
make_isuzu          3098.9677   2592.645      1.195      0.234   -2020.069    8218.004
make_jaguar         1.752e+04   2416.313      7.252      0.000    1.28e+04    2.23e+04
make_mazda           306.6568   2134.567      0.144      0.886   -3907.929    4521.243
make_mercedes-benz  1.698e+04   2320.871      7.318      0.000    1.24e+04    2.16e+04
make_mercury        2958.1002   3605.739      0.820      0.413   -4161.236    1.01e+04
make_mitsubishi    -1188.8337   2284.697     -0.520      0.604   -5699.844    3322.176
make_nissan        -1211.5463   2073.422     -0.584      0.560   -5305.405    2882.312
make_peugot         3057.0217   4255.809      0.718      0.474   -5345.841    1.15e+04
make_plymouth       -894.5921   2332.746     -0.383      0.702   -5500.473    3711.289
make_porsche        9558.8747   3688.038      2.592      0.010    2277.044    1.68e+04
make_renault       -2124.9722   2847.536     -0.746      0.457   -7747.277    3497.333
make_saab           3490.5333   2319.189      1.505      0.134   -1088.579    8069.645
make_subaru        -1.636e+04   4002.796     -4.087      0.000   -2.43e+04   -8456.659
make_toyota         -770.9677   1911.754     -0.403      0.687   -4545.623    3003.688
make_volkswagen      406.9179   2219.714      0.183      0.855   -3975.788    4789.623
make_volvo          5433.7129   2397.030      2.267      0.025     700.907    1.02e+04
fuel_system_2bbl    2142.1594   2232.214      0.960      0.339   -2265.226    6549.545
fuel_system_4bbl     464.1109   3999.976      0.116      0.908   -7433.624    8361.846
fuel_system_idi     1.991e+04   6622.812      3.007      0.003    6837.439     3.3e+04
fuel_system_mfi     3716.5201   3936.805      0.944      0.347   -4056.488    1.15e+04
fuel_system_mpfi    3964.1109   2267.538      1.748      0.082    -513.019    8441.241
fuel_system_spdi    3240.0003   2719.925      1.191      0.235   -2130.344    8610.344
fuel_system_spfi     932.1959   4019.476      0.232      0.817   -7004.041    8868.433
engine_type_dohcv  -1.208e+04   4205.826     -2.872      0.005   -2.04e+04   -3773.504
engine_type_l      -4833.9860   3763.812     -1.284      0.201   -1.23e+04    2597.456
engine_type_ohc    -4038.8848   1213.598     -3.328      0.001   -6435.067   -1642.702
engine_type_ohcf    9618.9281   3504.600      2.745      0.007    2699.286    1.65e+04
engine_type_ohcv    3051.7629   1445.185      2.112      0.036     198.323    5905.203
engine_type_rotor   1403.9928   3217.402      0.436      0.663   -4948.593    7756.579
num_of_doors_two    -419.9640    521.754     -0.805      0.422   -1450.139     610.211
==============================================================================
Omnibus:                       65.777   Durbin-Watson:                   1.217
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              399.594
Skew:                           1.059   Prob(JB):                     1.70e-87
Kurtosis:                       9.504   Cond. No.                     3.26e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.26e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

完全等于第一次调用Statsmodels的结果:
random.seed(1234)
lm_python = smf.ols('price ~ make + fuel_system + engine_type + num_of_doors + bore + compression_ratio + height + peak_rpm + 1', data = df)
lm_python.fit().summary()

                            OLS Regression Results
==============================================================================
Dep. Variable:                  price   R-squared:                       0.894
Model:                            OLS   Adj. R-squared:                  0.868
Method:                 Least Squares   F-statistic:                     35.54
Date:                Thu, 21 Feb 2019   Prob (F-statistic):           5.24e-62
Time:                        18:17:37   Log-Likelihood:                -1899.7
No. Observations:                 205   AIC:                             3879.
Df Residuals:                     165   BIC:                             4012.
Df Model:                          39
Covariance Type:            nonrobust
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept              1.592e+04   1.21e+04      1.320      0.189   -7898.396    3.97e+04
make[T.audi]           6519.7045   2371.807      2.749      0.007    1836.700    1.12e+04
make[T.bmw]            1.427e+04   2292.551      6.223      0.000    9740.771    1.88e+04
make[T.chevrolet]      -571.8236   2860.026     -0.200      0.842   -6218.788    5075.141
make[T.dodge]         -1186.3430   2261.240     -0.525      0.601   -5651.039    3278.353
make[T.honda]          2779.6496   2891.626      0.961      0.338   -2929.709    8489.009
make[T.isuzu]          3098.9677   2592.645      1.195      0.234   -2020.069    8218.004
make[T.jaguar]         1.752e+04   2416.313      7.252      0.000    1.28e+04    2.23e+04
make[T.mazda]           306.6568   2134.567      0.144      0.886   -3907.929    4521.243
make[T.mercedes-benz]  1.698e+04   2320.871      7.318      0.000    1.24e+04    2.16e+04
make[T.mercury]        2958.1002   3605.739      0.820      0.413   -4161.236    1.01e+04
make[T.mitsubishi]    -1188.8337   2284.697     -0.520      0.604   -5699.844    3322.176
make[T.nissan]        -1211.5463   2073.422     -0.584      0.560   -5305.405    2882.312
make[T.peugot]         3057.0217   4255.809      0.718      0.474   -5345.841    1.15e+04
make[T.plymouth]       -894.5921   2332.746     -0.383      0.702   -5500.473    3711.289
make[T.porsche]        9558.8747   3688.038      2.592      0.010    2277.044    1.68e+04
make[T.renault]       -2124.9722   2847.536     -0.746      0.457   -7747.277    3497.333
make[T.saab]           3490.5333   2319.189      1.505      0.134   -1088.579    8069.645
make[T.subaru]        -1.636e+04   4002.796     -4.087      0.000   -2.43e+04   -8456.659
make[T.toyota]         -770.9677   1911.754     -0.403      0.687   -4545.623    3003.688
make[T.volkswagen]      406.9179   2219.714      0.183      0.855   -3975.788    4789.623
make[T.volvo]          5433.7129   2397.030      2.267      0.025     700.907    1.02e+04
fuel_system[T.2bbl]    2142.1594   2232.214      0.960      0.339   -2265.226    6549.545
fuel_system[T.4bbl]     464.1109   3999.976      0.116      0.908   -7433.624    8361.846
fuel_system[T.idi]     1.991e+04   6622.812      3.007      0.003    6837.439     3.3e+04
fuel_system[T.mfi]     3716.5201   3936.805      0.944      0.347   -4056.488    1.15e+04
fuel_system[T.mpfi]    3964.1109   2267.538      1.748      0.082    -513.019    8441.241
fuel_system[T.spdi]    3240.0003   2719.925      1.191      0.235   -2130.344    8610.344
fuel_system[T.spfi]     932.1959   4019.476      0.232      0.817   -7004.041    8868.433
engine_type[T.dohcv]  -1.208e+04   4205.826     -2.872      0.005   -2.04e+04   -3773.504
engine_type[T.l]      -4833.9860   3763.812     -1.284      0.201   -1.23e+04    2597.456
engine_type[T.ohc]    -4038.8848   1213.598     -3.328      0.001   -6435.067   -1642.702
engine_type[T.ohcf]    9618.9281   3504.600      2.745      0.007    2699.286    1.65e+04
engine_type[T.ohcv]    3051.7629   1445.185      2.112      0.036     198.323    5905.203
engine_type[T.rotor]   1403.9928   3217.402      0.436      0.663   -4948.593    7756.579
num_of_doors[T.two]    -419.9640    521.754     -0.805      0.422   -1450.139     610.211
bore                   3993.4308   1373.487      2.908      0.004    1281.556    6705.306
compression_ratio     -1200.5665    460.681     -2.606      0.010   -2110.156    -290.977
height                  -80.7141    146.219     -0.552      0.582    -369.417     207.988
peak_rpm                 -0.5903      0.790     -0.747      0.456      -2.150       0.970
==============================================================================
Omnibus:                       65.777   Durbin-Watson:                   1.217
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              399.594
Skew:                           1.059   Prob(JB):                     1.70e-87
Kurtosis:                       9.504   Cond. No.                     3.26e+05
==============================================================================

需要检查预测变量中的对应关系,因为pd.get_dummies对所有虚拟变量进行了广泛的获取,并且Statsmodels在分类变量选择中应用了n-1水平。

关于python - Patsy版本和Dummy List版本之间使用Statsmodels进行线性回归的差异,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/54751637/

10-12 20:27