问题描述
我的训练数据集有大约 200,000 条记录,我有 500 个特征.(这些是来自零售组织的销售数据).大多数特征为 0/1 并存储为稀疏矩阵.
My training dataset has about 200,000 records and I have 500 features. (These are sales data from a retail org). Most of the features are 0/1 and is stored as a sparse matrix.
目标是预测购买大约 200 种产品的概率.因此,我需要使用相同的 500 个特征来预测 200 种产品的购买概率.由于 glmnet 是模型创建的自然选择,因此我考虑为 200 个产品并行实施 glmnet.(因为所有 200 个模型都是独立的)但我坚持使用 foreach.我执行的代码是:
The goal is to predict the probability to buy for about 200 products. So, I would need to use the same 500 features to predict the probability of purchase for 200 products. Since glmnet is a natural choice for model creation, I thought about implementing glmnet in parallel for the 200 products. (Since all the 200 models are independent) But I am stuck using foreach. The code I executed was:
foreach(i = 1:ncol(target)) %dopar%
{
assign(model[i],cv.glmnet(x,target[,i],family="binomial",alpha=0,type.measure="auc",grouped=FALSE,standardize=FALSE,parallel=TRUE))
}
model 是一个列表 - 包含 200 个模型名称的列表,我想在其中存储相应的模型.
model is a list - having the list of 200 model names where I want to store the respective models.
以下代码有效.但它没有利用并行结构,大约需要一天时间才能完成!
The following code works. But it doesn't exploit the parallel structure and takes about a day to finish !
for(i in 1:ncol(target))
{ assign(model[i],cv.glmnet(x,target[,i],family="binomial",alpha=0,type.measure="auc",grouped=FALSE,standardize=FALSE,parallel=TRUE))
}
有人可以指点我在这种情况下如何利用并行结构吗?
Can someone point to me on how to exploit the parallel structure in this case?
推荐答案
为了并行执行cv.glmnet",你必须指定parallel=TRUE
选项,和 注册一个 foreach 并行后端.这允许您选择最适合您的计算环境的并行后端.
In order to execute "cv.glmnet" in parallel, you have to specify the parallel=TRUE
option, and register a foreach parallel backend. This allows you to choose the parallel backend that works best for your computing environment.
这是来自 cv.glmnet 手册页的并行"参数的文档:
Here's the documentation for the "parallel" argument from the cv.glmnet man page:
parallel:如果为 'TRUE',则使用平行的 'foreach' 来适应每个折叠.必须事先并行注册,例如doMC"或其他.请参阅下面的示例.
以下是使用适用于 Windows、Mac OS X 和 Linux 的 doParallel 包的示例:
Here's an example using the doParallel package which works on Windows, Mac OS X, and Linux:
library(doParallel)
registerDoParallel(4)
m <- cv.glmnet(x, target[,1], family="binomial", alpha=0, type.measure="auc",
grouped=FALSE, standardize=FALSE, parallel=TRUE)
这个对 cv.glmnet 的调用将使用四个工作线程并行执行.在 Linux 和 Mac OS X 上,它将使用mclapply"执行任务,而在 Windows 上它将使用clusterApplyLB".
This call to cv.glmnet will execute in parallel using four workers. On Linux and Mac OS X, it will execute the tasks using "mclapply", while on Windows it will use "clusterApplyLB".
嵌套并行性变得棘手,如果只有 4 个工人,可能不会有太大帮助.我会尝试在 cv.glmnet 周围使用正常的 for 循环(如您的第二个示例),并注册了一个并行后端,然后在添加另一个并行级别之前查看性能.
Nested parallelism gets tricky, and may not help a lot with only 4 workers. I would try using a normal for loop around cv.glmnet (as in your second example) with a parallel backend registered and see what the performance is before adding another level of parallelism.
另请注意,当您注册并行后端时,第一个示例中对模型"的分配将不起作用.当并行运行时,副作用通常会被丢弃,就像大多数并行编程包一样.
Also note that the assignment to "model" in your first example isn't going to work when you register a parallel backend. When running in parallel, side-effects generally get thrown away, as with most parallel programming packages.
这篇关于在 R 中并行执行 cv.glmnet的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!