问题描述
我正在尝试并行运行 R 以运行回归.我正在尝试使用降雪库(但我对任何方法都持开放态度).目前,我正在运行以下回归,这需要很长时间才能运行.有人可以告诉我怎么做吗?
I'm trying to run R in parallel to run a regression. I'm trying to use the snowfall library (but am open to any approach). Currently, I'm running the following regression which is taking an extremely long time to run. Can someone show me how to do this?
sales_day_region_ctgry_lm <- lm(log(sales_out+1)~factor(region_out)
+ date_vector_out + factor(date_vector_out) +
factor(category_out) + mean_temp_out)
我已经开始了以下路径:
I've started down the following path:
library(snowfall)
sfInit(parallel = TRUE, cpus=4, type="SOCK")
wrapper <- function() {
return(lm(log(sales_out+1)~factor(region_out) + date_vector_out +
factor(date_vector_out) + factor(category_out) + mean_temp_out))
}
output_lm <- sfLapply(*no idea what to do here*,wrapper)
sfStop()
summary(output_lm)
但是这种方法充满了错误.
But this approach is riddled with errors.
谢谢!
推荐答案
partools 包通过其 calm()
函数提供了一种简单、现成的并行线性回归实现.(ca"前缀代表块平均".)
The partools package offers an easy, off-the-shelf implementation of parallelised linear regression via its calm()
function. (The "ca" prefix stands for "chunk averaging".)
在你的情况下——撇开@Roland 关于混合因子和连续预测变量的正确评论——解决方案应该很简单:
In your case -- leaving aside @Roland's correct comment about mixing up factor and continuous predictors -- the solution should be as simple as:
library(partools)
#library(parallel) ## loads as dependency
cls <- makeCluster(4) ## Or, however many cores you want/have.
sales_day_region_ctgry_calm <-
calm(
cls,
"log(sales_out+1) ~ factor(region_out) + date_vector_out +
factor(date_vector_out) + factor(category_out) + mean_temp_out,
data=YOUR_DATA_HERE"
)
请注意,模型调用是在引号内描述的.进一步注意,如果以任何方式(例如按日期)排序,您可能需要先随机化您的数据.请参阅 partools 插图 了解更多详情.
Note that the model call is described within quotation marks. Note further that you may need to randomise your data first if it is ordered in any way (e.g. by date.) See the partools vignette for more details.
这篇关于R中的平行回归(可能有降雪)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!