然后在每个组上创建回归模型

然后在每个组上创建回归模型

本文介绍了dplyr版本对数据帧进行分组,然后在每个组上创建回归模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

任何人都可以向以下问题提出 dplyr 的回答?

Can anyone suggest a dplyr answer to the following question?Split data.frame by country, and create linear regression model on each subset

为了完整性,链接中的问题和答案包含在下面。

For completeness, the question and answer from the link are included below.

为了参考,这里是Josh的问题:

For reference, here's Josh's question:

拥有世界银行的数据框架,看起来像这样;

I have a data.frame of data from the World Bank which looks something like this;

  country date BirthRate     US.
4   Aruba 2011    10.584 25354.8
5   Aruba 2010    10.804 24289.1
6   Aruba 2009    11.060 24639.9
7   Aruba 2008    11.346 27549.3
8   Aruba 2007    11.653 25921.3
9   Aruba 2006    11.977 24015.4

所有这些在这个数据框架中的70个国家/地区的一些子集,我会喜欢运行线性回归。如果我使用以下内容,我会为一个国家/地区获得一个不错的lm;

All in all there 70 something sub sets of countries in this data frame that I would like to run a linear regression on. If I use the following I get a nice lm for a single country;

andora = subset(high.sub, country == "Andorra")

andora.lm = lm(BirthRate~US., data = andora)

anova(andora.lm)
summary(andora.lm)

但是当我尝试在for循环中使用相同类型的代码时,我收到错误我将在代码之下打印;

But when I try to use the same type of code in a for loop, I get an error which I'll print below the code;

high.sub = subset(highInc, date > 1999 & date < 2012)
high.sub <- na.omit(high.sub)
highnames <- unique(high.sub$country)

for (i in highnames) {
  linmod <- lm(BirthRate~US., data = high.sub, subset = (country == "[i]"))
}

#Error message:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
  0 (non-NA) cases

如果我可以得到这个循环运行,我最好将附加系数,甚至更好的每个模型的r平方值到一个空数据。帧。任何帮助将不胜感激。

If I can get this loop to run I would ideally like to append the coefficients and even better the r-squared values for each model to an empty data.frame. Any help would be greatly appreciated.

为了参考,这里是jlhoward的答案(纳入BondedDust的评论)使用这个优秀问题中的*应用功能:

For reference, here's jlhoward's answer (incorporating BondedDust's comment) making use of the *apply functions found in this excellent question:R Grouping functions: sapply vs. lapply vs. apply. vs. tapply vs. by vs. aggregate

models <- sapply(unique(as.character(df$country)),
                 function(cntry)lm(BirthRate~US.,df,subset=(country==cntry)),
                 simplify=FALSE,USE.NAMES=TRUE)

# to summarize all the models
lapply(models,summary)
# to run anova on all the models
lapply(models,anova)

#This produces a named list of models, so you could extract the model for Aruba as:
models[["Aruba"]]


推荐答案

dplyr 返回列表是不可能的。如果你只需要拦截和斜率@jazzurro的答案就是这样,但如果你需要整个模型,你需要做一些像

Returning a list from dplyr is not possible yet. If you just need the intercept and slope @jazzurro 's answer is the way, but if you need the whole model you need to do something like

library(dplyr)
models <- df %>% group_by(country) %>% do(mod = lm(BirthRate ~ US., data = .))

然后,如果要在每个拟合模型上执行方差分析,可以使用 rowwise

Then if you want to perform ANOVA on each fitted model, you can do it using rowwise

models %>% rowwise %>% do(anova(.$mod))

但是,结果被强制转换为数据框,并不完全相同于 lapply(型号$ mod,anova)

but again the result is coerced to a data frame and is not quite the same as doing lapply(models$mod, anova).

现在(即直到下一个版本的 dplyr )如果您需要将整个结果存储在列表中,您可以使用 dlply plyr ,如 plyr :: dlply(df,国家,功能(d)anova(lm(BirthRate〜US。,data = d))),或者当然如果你不是绝对必须使用 dplyr 你可以去@SvenHohenstein的答案看起来像是一个更好的方法。

For now (ie until the next version of dplyr) if you need to store the whole result in a list, you can just use dlply from plyr, like plyr::dlply(df, "country", function(d) anova(lm(BirthRate ~ US., data = d))), or of course if you do not absolutely have to use dplyr you can go for @SvenHohenstein 's answer which looks like a better way of doing this anyway.

这篇关于dplyr版本对数据帧进行分组,然后在每个组上创建回归模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-06 07:00