问题描述
我创建了一个类似于以下脚本的脚本,以执行我称为加权"回归的操作:
I have created a script like the one below to do something I called as "weighted" regression:
library(plyr)
set.seed(100)
temp.df <- data.frame(uid=1:200,
bp=sample(x=c(100:200),size=200,replace=TRUE),
age=sample(x=c(30:65),size=200,replace=TRUE),
weight=sample(c(1:10),size=200,replace=TRUE),
stringsAsFactors=FALSE)
temp.df.expand <- ddply(temp.df,
c("uid"),
function(df) {
data.frame(bp=rep(df[,"bp"],df[,"weight"]),
age=rep(df[,"age"],df[,"weight"]),
stringsAsFactors=FALSE)})
temp.df.lm <- lm(bp~age,data=temp.df,weights=weight)
temp.df.expand.lm <- lm(bp~age,data=temp.df.expand)
您可以看到,在temp.df
中,每一行都有其权重,我的意思是总共有1178个样本,但是对于具有相同bp
和age
的行,它们合并为1行,在weight
列中表示.
You can see that in temp.df
, each row has its weight, what I mean is that there is a total of 1178 sample but for rows with same bp
and age
, they are merge into 1 row and represented in the weight
column.
我在lm
函数中使用了weight
参数,然后与另一个数据框交叉检查了结果,发现temp.df
数据框已展开".但是我发现lm
输出对于2个数据帧是不同的.
I used the weight
parameters in the lm
function, then I cross check the result with another dataframe that the temp.df
dataframe is "expanded". But I found the lm
outputs different for the 2 dataframe.
我是否误解了lm
函数中的weight
参数,有人可以让我知道如何正确显示(如没有手动扩展数据框的)数据集(如temp.df
)吗?谢谢.
Did I misinterpret the weight
parameters in lm
function, and can anyone let me know how to I run regression properly (i.e. without expanding the dataframe manually) for a dataset presented like temp.df
? Thanks.
推荐答案
这里的问题是,没有正确地添加自由度以获得正确的Df和均方和统计量.这样可以解决问题:
The problem here is that the degrees of freedom are not being properly added up to get the right Df and mean-sum-squares statistics. This will correct the problem:
temp.df.lm.aov <- anova(temp.df.lm)
temp.df.lm.aov$Df[length(temp.df.lm.aov$Df)] <-
sum(temp.df.lm$weights)-
sum(temp.df.lm.aov$Df[-length(temp.df.lm.aov$Df)] ) -1
temp.df.lm.aov$`Mean Sq` <- temp.df.lm.aov$`Sum Sq`/temp.df.lm.aov$Df
temp.df.lm.aov$`F value`[1] <- temp.df.lm.aov$`Mean Sq`[1]/
temp.df.lm.aov$`Mean Sq`[2]
temp.df.lm.aov$`Pr(>F)`[1] <- pf(temp.df.lm.aov$`F value`[1], 1,
temp.df.lm.aov$Df, lower.tail=FALSE)[2]
temp.df.lm.aov
Analysis of Variance Table
Response: bp
Df Sum Sq Mean Sq F value Pr(>F)
age 1 8741 8740.5 10.628 0.001146 **
Residuals 1176 967146 822.4
比较:
> anova(temp.df.expand.lm)
Analysis of Variance Table
Response: bp
Df Sum Sq Mean Sq F value Pr(>F)
age 1 8741 8740.5 10.628 0.001146 **
Residuals 1176 967146 822.4
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
令我有些惊讶的是,这种情况在R-help上并不常见.要么,要么我的搜索策略开发能力随着年龄的增长而减弱.
I am a bit surprised this has not come up more often on R-help. Either that or my search strategy development powers are weakening with old age.
这篇关于“加权" R中的回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!