本文介绍了为什么 caret::predict() 仅对 XGBtree 使用并行处理?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我理解为什么并行处理只能用于 XGB 训练而不能用于其他模型.然而,令人惊讶的是,我注意到使用 xgb 进行预测也使用并行处理.

I understand why parallel processing can be used during training only for XGB and cannot be used for other models. However, surprisingly I noticed that predict with xgb uses parallel processing too.

当我将超过 10M 的大型数据帧拆分为多个片段以预测使用 foreach %dopar% 时,我偶然注意到了这一点.这导致了一些错误,所以为了解决它们,我切换到 %do% 的顺序循环,但在终端中注意到所有处理器都在使用.

I noticed this by accident when I split my large 10M + data frame into pieces to predict on using foreach %dopar%. This caused some errors so to try to get around them I switched to sequential looping with %do% but noticed in the terminal that all processors where being used.

经过反复试验后,我发现 caret::train() 似乎使用并行处理,其中模型仅是 XGBtree(可能是其他模型),而不是其他模型.

After some trial and error I found that caret::train() appears to use parallel processing where the model is XGBtree only (possibly others) but not on other models.

当然可以与任何模型并行进行预测,而不仅仅是 xgb?

Surely predict could be done on parallel with any model, not just xgb?

使用所有可用的处理器是 caret::predict() 的默认行为还是预期行为?打开或关闭它?

Is it the default or expected behaviour of caret::predict() to use all available processors and is there a way to control this by e.g. switching it on or off?

可重现的例子:

library(tidyverse)
library(caret)
library(foreach)

# expected to see parallel here because caret and xgb with train()
xgbFit <- train(Species ~ ., data = iris, method = "xgbTree",
                trControl = trainControl(method = "cv", classProbs = TRUE))

iris_big <- do.call(rbind, replicate(1000, iris, simplify = F))

nr <- nrow(iris_big)
n <- 1000 # loop over in chunks of 20
pieces <- split(iris_big, rep(1:ceiling(nr/n), each=n, length.out=nr))
lenp <- length(pieces)

# did not expect to see parallel processing take place when running the block below
predictions <- foreach(i = seq_len(lenp)) %do% { # %do% is a sequential loop

  # get prediction
  preds <- pieces[[i]] %>%
    mutate(xgb_prediction = predict(xgbFit, newdata = .))

  return(preds)
}

如果您将 method = "xgbTree" 更改为例如method = "knn" 然后再次尝试运行循环,只使用了一个处理器.

If you change method = "xgbTree" to e.g. method = "knn" and then try to run the loop again, only one processor is used.

所以 predict 似乎会根据模型的类型自动使用并行处理.

So predict seems to use parallel processing automatically depending on the type of model.

这是正确的吗?可控吗?

Is this correct?Is it controllable?

推荐答案

在本期可以找到你需要的信息:

In this issue you can find the information you need:

https://github.com/dmlc/xgboost/issues/1345

总而言之,如果您使用并行性训练模型,则预测方法也将通过并行处理运行.如果你想改变后一种行为,你必须改变一个设置:

As a summary, if you trained your model with parallelism, the predict method will also run with parallel processing.If you want to change the latter behaviour you must change a setting:

xgb.parameters(bst) <- list(nthread = 1)

另一种方法是更改​​环境变量:

An alternative, is to change an environment variable:

OMP_NUM_THREADS

正如你所解释的,这只发生在 xgbTree

And as you explain, this only happens for xgbTree

这篇关于为什么 caret::predict() 仅对 XGBtree 使用并行处理?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-13 19:33