为什么我会得到NA系数，以及`lm`如何降低交互的参考水平

本文介绍了为什么我会得到NA系数，以及`lm`如何降低交互的参考水平的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图了解R如何确定线性模型中相互作用的参考基团.请考虑以下内容:

I am trying to understand how R determines reference groups for interactions in a linear model. Consider the following:

df <- structure(list(id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 
5L, 5L, 5L, 5L, 5L, 5L), .Label = c("1", "2", "3", "4", "5"), class = "factor"), 
    year = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 
    1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 
    2L, 1L, 2L, 1L, 2L), .Label = c("1", "2"), class = "factor"), 
    treatment = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L), .Label = c("0", "1"), class = "factor"), 
    y = c(1.4068142116718, 2.67041187927052, 2.69166439169131, 
    3.56550324537293, 1.60021286173782, 4.26629963353237, 3.85741108250572, 
    5.15740731957689, 4.15629768365669, 6.14875441068499, 3.31277276551286, 
    3.47223277168367, 3.74152201649338, 4.02734382610191, 4.49388620764795, 
    5.6432833241724, 4.76639399631094, 4.16885857079297, 4.96830394378801, 
    5.6286092105837, 6.60521404151111, 5.51371821706176, 3.97244221149279, 
    5.68793413111161, 4.90457233598412, 6.02826151378941, 4.92468415350312, 
    8.23718422822134, 5.87695836962708, 7.47264895892575)), .Names = c("id", 
"year", "treatment", "y"), row.names = c(NA, -30L), class = "data.frame")


lm(y ~ -1 + id + year + year:treatment, df)

#Coefficients:
#             id1               id2               id3               id4  
#          2.6585            3.9933            4.1161            5.3544  
#             id5             year2  year1:treatment1  year2:treatment1  
#          6.1991            0.7149           -0.6317                NA

R尝试估算互动的完整集合，而不是始终如一地省略参考群体.结果，我得到结果中的NA.

R tries to estimate the full set of interactions instead of consistently omitting a reference group. As a result, I am getting NA's in the results.

此外，R与它丢弃的基团不一致.我想估计一个模型，其主要作用和相互作用具有相同的省略组(year1).如何强制R忽略先前模型中的year1和year1:treatment1?

Also, R is inconsistent with which groups it drops. I would like to estimate a model with the same omitted group (year1) in the main effect and interactions. How to force R to omit year1 and year1:treatment1 from the preceding model?

我了解此问题有几种解决方法(例如，手动创建所有变量并将其写到模型的公式中).但是我估计的实际模型是此问题的复杂得多的版本，这种解决方法会很麻烦.

I understand that there are several workarounds for this problem (e.g. creating all the variables by hand and writing them out in the model's formula). But the actual models I am estimating are much more complicated versions of this problem and such a workaround would be cumbersome.

推荐答案

这两者之间没有因果关系.您得到NA纯粹是因为变量具有嵌套.

There is no causality between those two. You get NA purely because your variables have nesting.

没有不一致，但是model.matrix有其自己的规则.您似乎会看到不一致"的对比，因为您没有主要作用treatment，而只有交互作用treatment:year.

There is no inconsistency but model.matrix has its own rule. You get seemingly "inconsistent" contrasts because you don't have main effect treatment but only interaction treatment:year.

下面，我将首先解释NA系数，然后解释交互作用的对比，最后给出一些建议.

In the following, I will first explain NA coefficients, then the contrasts for interaction, and finally give some suggestions.

默认情况下，对比因子用于对比因子，默认情况下，contr.treatement，第一个因子水平用作参考水平.看一下您的数据:

By default, contrast treatment is used for contrasting factor, and by default of contr.treatement, the first factor level is taken as reference level. Have a look at your data:

levels(df$id)
# [1] "1" "2" "3" "4" "5"

levels(df$year)
# [1] "1" "2"

levels(df$treatment)
# [1] "0" "1"

现在来看一个简单的线性模型:

Now take a look at a simple linear model:

lm(y ~ id + year + treatment, df)

#Coefficients:
#(Intercept)          id2          id3          id4          id5        year2  
#      2.153        1.651        1.773        2.696        3.541        1.094  
# treatment1  
#         NA

您可以看到id1，year1和treatment0不存在，因为它们被用作参考.

You can see that id1, year1 and treatment0 are not there, as they are taken as reference.

即使没有交互，您也已经具有NA系数.这意味着treatment嵌套在span{id, year}中.当您进一步包含treatment:year时，这种嵌套仍然存在.

Even without interaction, you already have an NA coefficient. This implies that treatment is nested in span{id, year}. When you further include treatment:year, such nesting still exists.

实际上，进一步的测试表明treatment嵌套在id中:

In fact, a further test shows that treatment is nested in id:

lm(y ~ id + treatment, df)

#    Coefficients:
#(Intercept)          id2          id3          id4          id5   treatment1  
#      2.700        1.651        1.773        2.696        3.541           NA

总而言之，变量treatment对于您的建模目的是完全多余的.如果包含id，则无需包含treatment或treatment:*，其中*可以是任何变量.

In summary, variable treatment is completely redundant for your modelling purpose. If you include id, then there is no need to include treatment or treatment:* where * can be any variable.

当回归模型中有许多因子变量时，嵌套非常容易.请注意，对比并不一定会消除嵌套，因为对比只能识别变量名称，而不能识别潜在的数字特征.以下面的示例为例，如何作弊contr.treatment:

It is very easy to get nesting when you have many factor variables in a regression model. Note, contrasts will not necessarily remove nesting, as contrast only recognises variable name, but not potential numerical feature. Take the following example for how to cheat contr.treatment:

df$ID <- df$id
lm(y ~ id + ID, df)

#Coefficients:
#(Intercept)          id2          id3          id4          id5          ID2  
#      2.700        1.651        1.773        2.696        3.541           NA  
#        ID3          ID4          ID5  
#         NA           NA           NA

看起来，对比度可以按预期工作，但是ID嵌套在id中，因此我们存在排名缺陷.

Look, contrasts works as expected, but ID is nested in id so we have rank-deficiency.

我们首先通过删除id变量来消除由NA施加的噪声.然后，具有treatment和year的回归模型将是完整等级，因此如果对比成功，则不应看到NA.

We first remove the noise imposed by NA, by dropping id variable. Then, a regression model with treatment and year will be full-rank, so no NA should be seen if contrasts is successful.

交互作用或高阶效果的对比度取决于低阶效果的存在.比较以下模型:

lm(y ~ year:treatment, df)  ## no low-order effects at all

#Coefficients:
#     (Intercept)  treatment0:year1  treatment1:year1  treatment0:year2  
#          5.4523           -1.3976           -1.3466           -0.6826  
#treatment1:year2  
#              NA  

lm(y ~ year + treatment + year:treatment, df)  ## with all low-order effects

#Coefficients:
#     (Intercept)        treatment1             year2  treatment1:year2  
#         4.05471           0.05094           0.71493           0.63170

在第一个模型中，不进行对比，因为不存在主要效果，而仅存在二阶效果. NA在这里是由于没有对比.

In the first model, no contrasts is done, because there is no presence of main effects, but only the 2nd order effect. The NA here is due to the absence of contrasts.

现在考虑另一组示例，包括一些但不是全部主要效果:

Now consider another set of examples, by including some but not all main effects:

lm(y ~ year + year:treatment, df)  ## main effect `treatment` is missing

#Coefficients:
#     (Intercept)             year2  year1:treatment1  year2:treatment1  
#         4.05471           0.71493           0.05094           0.68264  

lm(y ~ treatment + year:treatment, df)  ## main effect `year` is missing

#Coefficients:
#     (Intercept)        treatment1  treatment0:year2  treatment1:year2  
#         4.05471           0.05094           0.71493           1.34663

在这里，缺少主要影响的变量会发生交互作用的对比.例如，在第一个模型中，缺少主要效果treatment，因此交互作用下降了treatement0；而在第二个模型中，缺少主效应year，因此交互作用降低了year1.

Here, contrasts for interaction will happen to the variable whose main effect is missing. For example, in the first model, main effect treatment is missing so interaction drops treatement0; while in the second model, main effect year is missing so interaction drops year1.

在指定高阶效果时，始终包括所有低阶效果.这不仅提供了易于理解的对比行为，而且还具有其他一些吸引人的统计原因.您可以在交叉验证中阅读包括交互作用，但不包括模型中的主要影响.

Always include all low-order effects when specifying high-order effects. This not only gives easy-to-understand contrasts behaviour, but also has some other appealing statistical reason. You can read Including the interaction but not the main effects in a model on Cross Validated.

另一个建议是始终包含拦截.在线性回归中，具有截距的模型产生无偏估计，残差的均值为0.

Another suggestion, is to always include intercept. In linear regression, a model with intercept yields unbiased estimate, and residuals will have 0 mean.

这篇关于为什么我会得到NA系数，以及`lm`如何降低交互的参考水平的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

Treatment