问题描述
我试图了解R如何确定线性模型中相互作用的参考基团.请考虑以下内容:
I am trying to understand how R determines reference groups for interactions in a linear model. Consider the following:
df <- structure(list(id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L,
5L, 5L, 5L, 5L, 5L, 5L), .Label = c("1", "2", "3", "4", "5"), class = "factor"),
year = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L,
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L), .Label = c("1", "2"), class = "factor"),
treatment = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("0", "1"), class = "factor"),
y = c(1.4068142116718, 2.67041187927052, 2.69166439169131,
3.56550324537293, 1.60021286173782, 4.26629963353237, 3.85741108250572,
5.15740731957689, 4.15629768365669, 6.14875441068499, 3.31277276551286,
3.47223277168367, 3.74152201649338, 4.02734382610191, 4.49388620764795,
5.6432833241724, 4.76639399631094, 4.16885857079297, 4.96830394378801,
5.6286092105837, 6.60521404151111, 5.51371821706176, 3.97244221149279,
5.68793413111161, 4.90457233598412, 6.02826151378941, 4.92468415350312,
8.23718422822134, 5.87695836962708, 7.47264895892575)), .Names = c("id",
"year", "treatment", "y"), row.names = c(NA, -30L), class = "data.frame")
lm(y ~ -1 + id + year + year:treatment, df)
#Coefficients:
# id1 id2 id3 id4
# 2.6585 3.9933 4.1161 5.3544
# id5 year2 year1:treatment1 year2:treatment1
# 6.1991 0.7149 -0.6317 NA
R尝试估算互动的完整集合,而不是始终如一地省略参考群体.结果,我得到结果中的NA
.
R tries to estimate the full set of interactions instead of consistently omitting a reference group. As a result, I am getting NA
's in the results.
此外,R与它丢弃的基团不一致.我想估计一个模型,其主要作用和相互作用具有相同的省略组(year1
).如何强制R忽略先前模型中的year1
和year1:treatment1
?
Also, R is inconsistent with which groups it drops. I would like to estimate a model with the same omitted group (year1
) in the main effect and interactions. How to force R to omit year1
and year1:treatment1
from the preceding model?
我了解此问题有几种解决方法(例如,手动创建所有变量并将其写到模型的公式中).但是我估计的实际模型是此问题的复杂得多的版本,这种解决方法会很麻烦.
I understand that there are several workarounds for this problem (e.g. creating all the variables by hand and writing them out in the model's formula). But the actual models I am estimating are much more complicated versions of this problem and such a workaround would be cumbersome.
推荐答案
这两者之间没有因果关系.您得到NA
纯粹是因为变量具有嵌套.
There is no causality between those two. You get NA
purely because your variables have nesting.
没有不一致,但是model.matrix
有其自己的规则.您似乎会看到不一致"的对比,因为您没有主要作用treatment
,而只有交互作用treatment:year
.
There is no inconsistency but model.matrix
has its own rule. You get seemingly "inconsistent" contrasts because you don't have main effect treatment
but only interaction treatment:year
.
下面,我将首先解释NA
系数,然后解释交互作用的对比,最后给出一些建议.
In the following, I will first explain NA
coefficients, then the contrasts for interaction, and finally give some suggestions.
默认情况下,对比因子用于对比因子,默认情况下,contr.treatement
,第一个因子水平用作参考水平.看一下您的数据:
By default, contrast treatment is used for contrasting factor, and by default of contr.treatement
, the first factor level is taken as reference level. Have a look at your data:
levels(df$id)
# [1] "1" "2" "3" "4" "5"
levels(df$year)
# [1] "1" "2"
levels(df$treatment)
# [1] "0" "1"
现在来看一个简单的线性模型:
Now take a look at a simple linear model:
lm(y ~ id + year + treatment, df)
#Coefficients:
#(Intercept) id2 id3 id4 id5 year2
# 2.153 1.651 1.773 2.696 3.541 1.094
# treatment1
# NA
您可以看到id1
,year1
和treatment0
不存在,因为它们被用作参考.
You can see that id1
, year1
and treatment0
are not there, as they are taken as reference.
即使没有交互,您也已经具有NA
系数.这意味着treatment
嵌套在span{id, year}
中.当您进一步包含treatment:year
时,这种嵌套仍然存在.
Even without interaction, you already have an NA
coefficient. This implies that treatment
is nested in span{id, year}
. When you further include treatment:year
, such nesting still exists.
实际上,进一步的测试表明treatment
嵌套在id
中:
In fact, a further test shows that treatment
is nested in id
:
lm(y ~ id + treatment, df)
# Coefficients:
#(Intercept) id2 id3 id4 id5 treatment1
# 2.700 1.651 1.773 2.696 3.541 NA
总而言之,变量treatment
对于您的建模目的是完全多余的.如果包含id
,则无需包含treatment
或treatment:*
,其中*
可以是任何变量.
In summary, variable treatment
is completely redundant for your modelling purpose. If you include id
, then there is no need to include treatment
or treatment:*
where *
can be any variable.
当回归模型中有许多因子变量时,嵌套非常容易.请注意,对比并不一定会消除嵌套,因为对比只能识别变量名称,而不能识别潜在的数字特征.以下面的示例为例,如何作弊contr.treatment
:
It is very easy to get nesting when you have many factor variables in a regression model. Note, contrasts will not necessarily remove nesting, as contrast only recognises variable name, but not potential numerical feature. Take the following example for how to cheat contr.treatment
:
df$ID <- df$id
lm(y ~ id + ID, df)
#Coefficients:
#(Intercept) id2 id3 id4 id5 ID2
# 2.700 1.651 1.773 2.696 3.541 NA
# ID3 ID4 ID5
# NA NA NA
看起来,对比度可以按预期工作,但是ID
嵌套在id
中,因此我们存在排名缺陷.
Look, contrasts works as expected, but ID
is nested in id
so we have rank-deficiency.
我们首先通过删除id
变量来消除由NA
施加的噪声.然后,具有treatment
和year
的回归模型将是完整等级,因此如果对比成功,则不应看到NA
.
We first remove the noise imposed by NA
, by dropping id
variable. Then, a regression model with treatment
and year
will be full-rank, so no NA
should be seen if contrasts is successful.
交互作用或高阶效果的对比度取决于低阶效果的存在.比较以下模型:
lm(y ~ year:treatment, df) ## no low-order effects at all
#Coefficients:
# (Intercept) treatment0:year1 treatment1:year1 treatment0:year2
# 5.4523 -1.3976 -1.3466 -0.6826
#treatment1:year2
# NA
lm(y ~ year + treatment + year:treatment, df) ## with all low-order effects
#Coefficients:
# (Intercept) treatment1 year2 treatment1:year2
# 4.05471 0.05094 0.71493 0.63170
在第一个模型中,不进行对比,因为不存在主要效果,而仅存在二阶效果. NA
在这里是由于没有对比.
In the first model, no contrasts is done, because there is no presence of main effects, but only the 2nd order effect. The NA
here is due to the absence of contrasts.
现在考虑另一组示例,包括一些但不是全部主要效果:
Now consider another set of examples, by including some but not all main effects:
lm(y ~ year + year:treatment, df) ## main effect `treatment` is missing
#Coefficients:
# (Intercept) year2 year1:treatment1 year2:treatment1
# 4.05471 0.71493 0.05094 0.68264
lm(y ~ treatment + year:treatment, df) ## main effect `year` is missing
#Coefficients:
# (Intercept) treatment1 treatment0:year2 treatment1:year2
# 4.05471 0.05094 0.71493 1.34663
在这里,缺少主要影响的变量会发生交互作用的对比.例如,在第一个模型中,缺少主要效果treatment
,因此交互作用下降了treatement0
;而在第二个模型中,缺少主效应year
,因此交互作用降低了year1
.
Here, contrasts for interaction will happen to the variable whose main effect is missing. For example, in the first model, main effect treatment
is missing so interaction drops treatement0
; while in the second model, main effect year
is missing so interaction drops year1
.
在指定高阶效果时,始终包括所有低阶效果.这不仅提供了易于理解的对比行为,而且还具有其他一些吸引人的统计原因.您可以在交叉验证中阅读包括交互作用,但不包括模型中的主要影响.
Always include all low-order effects when specifying high-order effects. This not only gives easy-to-understand contrasts behaviour, but also has some other appealing statistical reason. You can read Including the interaction but not the main effects in a model on Cross Validated.
另一个建议是始终包含拦截.在线性回归中,具有截距的模型产生无偏估计,残差的均值为0.
Another suggestion, is to always include intercept. In linear regression, a model with intercept yields unbiased estimate, and residuals will have 0 mean.
这篇关于为什么我会得到NA系数,以及`lm`如何降低交互的参考水平的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!