我正在尝试使用一些分类变量拟合线性模型
model <- lm(price ~ carat+cut+color+clarity)
summary(model)
答案是:
Call:
lm(formula = price ~ carat + cut + color + clarity)
Residuals:
Min 1Q Median 3Q Max
-11495.7 -688.5 -204.1 458.2 9305.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3696.818 47.948 -77.100 < 2e-16 ***
carat 8843.877 40.885 216.311 < 2e-16 ***
cut.L 755.474 68.378 11.049 < 2e-16 ***
cut.Q -349.587 60.432 -5.785 7.74e-09 ***
cut.C 200.008 52.260 3.827 0.000131 ***
cut^4 12.748 42.642 0.299 0.764994
color.L 1905.109 61.050 31.206 < 2e-16 ***
color.Q -675.265 56.056 -12.046 < 2e-16 ***
color.C 197.903 51.932 3.811 0.000140 ***
color^4 71.054 46.940 1.514 0.130165
color^5 2.867 44.586 0.064 0.948729
color^6 50.531 40.771 1.239 0.215268
clarity.L 4045.728 108.363 37.335 < 2e-16 ***
clarity.Q -1545.178 102.668 -15.050 < 2e-16 ***
clarity.C 999.911 88.301 11.324 < 2e-16 ***
clarity^4 -665.130 66.212 -10.045 < 2e-16 ***
clarity^5 920.987 55.012 16.742 < 2e-16 ***
clarity^6 -712.168 52.346 -13.605 < 2e-16 ***
clarity^7 1008.604 45.842 22.002 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1167 on 4639 degrees of freedom
Multiple R-squared: 0.9162, Adjusted R-squared: 0.9159
F-statistic: 2817 on 18 and 4639 DF, p-value: < 2.2e-16
但我不明白为什么答案是“.L,.Q,.C,^4, ...”,有问题但我不知道有什么问题,我已经尝试过使用函数因子每个变量。
最佳答案
您会遇到回归函数如何处理“有序”(序数)因子变量,并且默认的一组对比是高达 n-1 次的正交多项式对比,其中 n 是该因子的级别数。解释这个结果不会很容易……特别是如果没有自然秩序。即使有,并且在这种情况下很可能有,您可能不想要默认排序(按因子级别按字母顺序排列),并且您可能不希望多项式对比中的度数超过几个。
在 ggplot2 的钻石数据集的情况下,因子级别设置正确,但大多数新手在偶然发现有序因子时会得到诸如“Excellent”
> levels(diamonds$cut)
[1] "Fair" "Good" "Very Good" "Premium" "Ideal"
> levels(diamonds$clarity)
[1] "I1" "SI2" "SI1" "VS2" "VS1" "VVS2" "VVS1" "IF"
> levels(diamonds$color)
[1] "D" "E" "F" "G" "H" "I" "J"
在正确设置有序因子后使用有序因子的一种方法是将它们包装在
as.numeric
中,这样可以对趋势进行线性测试。> contrasts(diamonds$cut) <- contr.treatment(5) # Removes ordering
> model <- lm(price ~ carat+cut+as.numeric(color)+as.numeric(clarity), diamonds)
> summary(model)
Call:
lm(formula = price ~ carat + cut + as.numeric(color) + as.numeric(clarity),
data = diamonds)
Residuals:
Min 1Q Median 3Q Max
-19130.3 -696.1 -176.8 556.9 9599.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5189.460 36.577 -141.88 <2e-16 ***
carat 8791.452 12.659 694.46 <2e-16 ***
cut2 909.433 35.346 25.73 <2e-16 ***
cut3 1129.518 32.772 34.47 <2e-16 ***
cut4 1156.989 32.427 35.68 <2e-16 ***
cut5 1264.128 32.160 39.31 <2e-16 ***
as.numeric(color) -318.518 3.282 -97.05 <2e-16 ***
as.numeric(clarity) 522.198 3.521 148.31 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1227 on 53932 degrees of freedom
Multiple R-squared: 0.9054, Adjusted R-squared: 0.9054
F-statistic: 7.371e+04 on 7 and 53932 DF, p-value: < 2.2e-16
关于r - R中具有分类变量的线性模型,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/30159162/