r - R中具有分类变量的线性模型

我正在尝试使用一些分类变量拟合线性模型

model <- lm(price ~ carat+cut+color+clarity)
summary(model)

答案是:

Call:
lm(formula = price ~ carat + cut + color + clarity)

Residuals:
     Min       1Q   Median       3Q      Max
-11495.7   -688.5   -204.1    458.2   9305.3

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -3696.818     47.948 -77.100  < 2e-16 ***
carat        8843.877     40.885 216.311  < 2e-16 ***
cut.L         755.474     68.378  11.049  < 2e-16 ***
cut.Q        -349.587     60.432  -5.785 7.74e-09 ***
cut.C         200.008     52.260   3.827 0.000131 ***
cut^4          12.748     42.642   0.299 0.764994
color.L      1905.109     61.050  31.206  < 2e-16 ***
color.Q      -675.265     56.056 -12.046  < 2e-16 ***
color.C       197.903     51.932   3.811 0.000140 ***
color^4        71.054     46.940   1.514 0.130165
color^5         2.867     44.586   0.064 0.948729
color^6        50.531     40.771   1.239 0.215268
clarity.L    4045.728    108.363  37.335  < 2e-16 ***
clarity.Q   -1545.178    102.668 -15.050  < 2e-16 ***
clarity.C     999.911     88.301  11.324  < 2e-16 ***
clarity^4    -665.130     66.212 -10.045  < 2e-16 ***
clarity^5     920.987     55.012  16.742  < 2e-16 ***
clarity^6    -712.168     52.346 -13.605  < 2e-16 ***
clarity^7    1008.604     45.842  22.002  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1167 on 4639 degrees of freedom
Multiple R-squared:  0.9162,    Adjusted R-squared:  0.9159
F-statistic:  2817 on 18 and 4639 DF,  p-value: < 2.2e-16

但我不明白为什么答案是“.L,.Q,.C,^4, ...”，有问题但我不知道有什么问题，我已经尝试过使用函数因子每个变量。

最佳答案

您会遇到回归函数如何处理“有序”(序数)因子变量，并且默认的一组对比是高达 n-1 次的正交多项式对比，其中 n 是该因子的级别数。解释这个结果不会很容易……特别是如果没有自然秩序。即使有，并且在这种情况下很可能有，您可能不想要默认排序(按因子级别按字母顺序排列)，并且您可能不希望多项式对比中的度数超过几个。

在 ggplot2 的钻石数据集的情况下，因子级别设置正确，但大多数新手在偶然发现有序因子时会得到诸如“Excellent”

> levels(diamonds$cut)
[1] "Fair"      "Good"      "Very Good" "Premium"   "Ideal"
> levels(diamonds$clarity)
[1] "I1"   "SI2"  "SI1"  "VS2"  "VS1"  "VVS2" "VVS1" "IF"
> levels(diamonds$color)
[1] "D" "E" "F" "G" "H" "I" "J"

在正确设置有序因子后使用有序因子的一种方法是将它们包装在 as.numeric 中，这样可以对趋势进行线性测试。

> contrasts(diamonds$cut) <- contr.treatment(5) # Removes ordering
> model <- lm(price ~ carat+cut+as.numeric(color)+as.numeric(clarity), diamonds)
> summary(model)

Call:
lm(formula = price ~ carat + cut + as.numeric(color) + as.numeric(clarity),
    data = diamonds)

Residuals:
     Min       1Q   Median       3Q      Max
-19130.3   -696.1   -176.8    556.9   9599.8

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)
(Intercept)         -5189.460     36.577 -141.88   <2e-16 ***
carat                8791.452     12.659  694.46   <2e-16 ***
cut2                  909.433     35.346   25.73   <2e-16 ***
cut3                 1129.518     32.772   34.47   <2e-16 ***
cut4                 1156.989     32.427   35.68   <2e-16 ***
cut5                 1264.128     32.160   39.31   <2e-16 ***
as.numeric(color)    -318.518      3.282  -97.05   <2e-16 ***
as.numeric(clarity)   522.198      3.521  148.31   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1227 on 53932 degrees of freedom
Multiple R-squared:  0.9054,    Adjusted R-squared:  0.9054
F-statistic: 7.371e+04 on 7 and 53932 DF,  p-value: < 2.2e-16

关于r - R中具有分类变量的线性模型，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/30159162/