问题描述
当我在插入符号中运行 2 个随机森林时,如果我设置了一个随机种子,我会得到完全相同的结果:
When I run 2 random forests in caret, I get the exact same results if I set a random seed:
library(caret)
library(doParallel)
set.seed(42)
myControl <- trainControl(method='cv', index=createFolds(iris$Species))
set.seed(42)
model1 <- train(Species~., iris, method='rf', trControl=myControl)
set.seed(42)
model2 <- train(Species~., iris, method='rf', trControl=myControl)
> all.equal(predict(model1, type='prob'), predict(model2, type='prob'))
[1] TRUE
但是,如果我注册一个并行后端来加速建模,每次运行模型时我都会得到不同的结果:
However, if I register a parallel back-end to speed up the modeling, I get a different result each time I run the model:
cl <- makeCluster(detectCores())
registerDoParallel(cl)
set.seed(42)
myControl <- trainControl(method='cv', index=createFolds(iris$Species))
set.seed(42)
model1 <- train(Species~., iris, method='rf', trControl=myControl)
set.seed(42)
model2 <- train(Species~., iris, method='rf', trControl=myControl)
stopCluster(cl)
> all.equal(predict(model1, type='prob'), predict(model2, type='prob'))
[1] "Component 2: Mean relative difference: 0.01813729"
[2] "Component 3: Mean relative difference: 0.02271638"
有什么办法可以解决这个问题吗?一个建议是使用 doRNG 包,但是 train
使用嵌套循环,目前不支持:
Is there any way to fix this issue? One suggestion was to use the doRNG package, but train
uses nested loops, which currently aren't supported:
library(doRNG)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
registerDoRNG()
set.seed(42)
myControl <- trainControl(method='cv', index=createFolds(iris$Species))
set.seed(42)
> model1 <- train(Species~., iris, method='rf', trControl=myControl)
Error in list(e1 = list(args = seq(along = resampleIndex)(), argnames = "iter", :
nested/conditional foreach loops are not supported yet.
See the package's vignette for a work around.
更新:我认为这个问题可以使用 doSNOW
和 clusterSetupRNG
来解决,但我无法完全解决.
UPDATE:I thought this problem could be solved using doSNOW
and clusterSetupRNG
, but I couldn't quite get there.
set.seed(42)
library(caret)
library(doSNOW)
cl <- makeCluster(8, type = "SOCK")
registerDoSNOW(cl)
myControl <- trainControl(method='cv', index=createFolds(iris$Species))
clusterSetupRNG(cl, seed=rep(12345,6))
a <- clusterCall(cl, runif, 10000)
model1 <- train(Species~., iris, method='rf', trControl=myControl)
clusterSetupRNG(cl, seed=rep(12345,6))
b <- clusterCall(cl, runif, 10000)
model2 <- train(Species~., iris, method='rf', trControl=myControl)
all.equal(a, b)
[1] TRUE
all.equal(predict(model1, type='prob'), predict(model2, type='prob'))
[1] "Component 2: Mean relative difference: 0.01890339"
[2] "Component 3: Mean relative difference: 0.01656751"
stopCluster(cl)
foreach 有什么特别之处,为什么不使用我在集群上启动的种子?对象 a
和 b
是相同的,那么为什么 model1
和 model2
不一样呢?
What's special about foreach, and why doesn't it use the seeds I initiated on the cluster? objects a
and b
are identical, so why not model1
and model2
?
推荐答案
使用 caret
包以并行模式运行完全可重现模型的一种简单方法是在调用列车控制时使用种子参数.以上问题到此解决,请查看 trainControl 帮助页面了解更多信息.
One easy way to run fully reproducible model in parallel mode using the caret
package is by using the seeds argument when calling the train control. Here the above question is resolved, check the trainControl help page for further infos.
library(doParallel); library(caret)
#create a list of seed, here change the seed for each resampling
set.seed(123)
#length is = (n_repeats*nresampling)+1
seeds <- vector(mode = "list", length = 11)
#(3 is the number of tuning parameter, mtry for rf, here equal to ncol(iris)-2)
for(i in 1:10) seeds[[i]]<- sample.int(n=1000, 3)
#for the last model
seeds[[11]]<-sample.int(1000, 1)
#control list
myControl <- trainControl(method='cv', seeds=seeds, index=createFolds(iris$Species))
#run model in parallel
cl <- makeCluster(detectCores())
registerDoParallel(cl)
model1 <- train(Species~., iris, method='rf', trControl=myControl)
model2 <- train(Species~., iris, method='rf', trControl=myControl)
stopCluster(cl)
#compare
all.equal(predict(model1, type='prob'), predict(model2, type='prob'))
[1] TRUE
这篇关于使用插入符号的完全可重现的并行模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!