在HPC上使用scikit-learn函数的并行选项的简便方法

本文介绍了在HPC上使用scikit-learn函数的并行选项的简便方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在scikit-learn的许多函数中，实现了用户友好的并行化.例如在sklearn.cross_validation.cross_val_score您只需在n_jobs参数中传递所需数量的计算作业即可.对于带有多核处理器的PC，它将非常好用.但是，如果我想在高性能群集中使用此类选项(已安装OpenMPI软件包，并使用SLURM进行资源管理)?据我所知，sklearn使用joblib进行并行化，而并行化则使用multiprocessing.而且，据我所知(例如， mpi中的Python多重处理)，Python程序已并行化使用multiprocessing易于扩展，通过mpirun实用程序可以扩展整个MPI体系结构.是否可以仅使用mpirun和n_jobs参数在多个计算节点上扩展sklearn函数的计算?

解决方案

SKLearn通过 Joblib 来管理其并行性. . Joblib可以将多处理后端换成其他分布式系统，例如 dask.distributed 或 IPython并行.有关详细信息，请参见sklearn github页面上的此问题. /p>

将Joblib与Dask.distributed一起使用的示例

从上面链接的问题页面获取的代码.

from sklearn.externals.joblib import parallel_backend

search = RandomizedSearchCV(model, param_space, cv=10, n_iter=1000, verbose=1)

with parallel_backend('dask', scheduler_host='your_scheduler_host:your_port'):
        search.fit(digits.data, digits.target)

这要求您在集群上设置dask.distributed调度程序和工作程序.可在此处获得常规说明: http://dask.readthedocs.io/en/latest/setup.html

将Joblib与`ipyparallel`

一起使用的示例

从同一问题页面获取的代码.

from sklearn.externals.joblib import Parallel, parallel_backend, register_parallel_backend

from ipyparallel import Client
from ipyparallel.joblib import IPythonParallelBackend

digits = load_digits()

c = Client(profile='myprofile')
print(c.ids)
bview = c.load_balanced_view()

# this is taken from the ipyparallel source code
register_parallel_backend('ipyparallel', lambda : IPythonParallelBackend(view=bview))

...

with parallel_backend('ipyparallel'):
        search.fit(digits.data, digits.target)

注意:在以上两个示例中，n_jobs参数似乎都不再重要了.

设置带有SLURM分发的dask.

对于SLURM，最简单的方法可能是使用 dask-jobqueue 项目

>>> from dask_jobqueue import SLURMCluster
>>> cluster = SLURMCluster(project='...', queue='...', ...)
>>> cluster.scale(20)

您还可以使用 dask-mpi 或 Dask的设置文档

直接使用dask.distributed

或者，您可以设置dask.distributed或IPyParallel集群，然后直接使用这些接口来并行化SKLearn代码.以下是SKLearn和Joblib开发人员Olivier Grisel在PyData Berlin上所做的示例视频: https://youtu.be/Ll6qWDbRTD0?t=1561

尝试Dask-ML

您还可以尝试使用Dask-ML软件包，该软件包具有一个RandomizedSearchCV对象，该对象与scikit-learn API兼容，但在Dask的基础上通过计算实现了

https://github.com/dask/dask-ml

pip install dask-ml

In many functions from scikit-learn implemented user-friendly parallelization. For example in sklearn.cross_validation.cross_val_score you just pass desired number of computational jobs in n_jobs argument. And for PC with multi-core processor it will work very nice. But if I want use such option in high performance cluster (with installed OpenMPI package and using SLURM for resource management) ? As I know sklearn uses joblib for parallelization, which uses multiprocessing. And, as I know (from this, for example, Python multiprocessing within mpi) Python programs parallelized with multiprocessing easy to scale oh whole MPI architecture with mpirun utility. Can I spread computation of sklearn functions on several computational nodes just using mpirun and n_jobs argument?

解决方案

SKLearn manages its parallelism with Joblib. Joblib can swap out the multiprocessing backend for other distributed systems like dask.distributed or IPython Parallel. See this issue on the sklearn github page for details.

Example using Joblib with Dask.distributed

Code taken from the issue page linked above.

from sklearn.externals.joblib import parallel_backend

search = RandomizedSearchCV(model, param_space, cv=10, n_iter=1000, verbose=1)

with parallel_backend('dask', scheduler_host='your_scheduler_host:your_port'):
        search.fit(digits.data, digits.target)

This requires that you set up a dask.distributed scheduler and workers on your cluster. General instructions are available here: http://dask.readthedocs.io/en/latest/setup.html

Example using Joblib with `ipyparallel`

Code taken from the same issue page.

from sklearn.externals.joblib import Parallel, parallel_backend, register_parallel_backend

from ipyparallel import Client
from ipyparallel.joblib import IPythonParallelBackend

digits = load_digits()

c = Client(profile='myprofile')
print(c.ids)
bview = c.load_balanced_view()

# this is taken from the ipyparallel source code
register_parallel_backend('ipyparallel', lambda : IPythonParallelBackend(view=bview))

...

with parallel_backend('ipyparallel'):
        search.fit(digits.data, digits.target)

Note: in both the above examples, the n_jobs parameter seems to not matter anymore.

Set up dask.distributed with SLURM

For SLURM the easiest way to do this is probably to use the dask-jobqueue project

>>> from dask_jobqueue import SLURMCluster
>>> cluster = SLURMCluster(project='...', queue='...', ...)
>>> cluster.scale(20)

You could also use dask-mpi or any of several other methods mentioned at Dask's setup documentation

Use dask.distributed directly

Alternatively you can set up a dask.distributed or IPyParallel cluster and then use these interfaces directly to parallelize your SKLearn code. Here is an example video of SKLearn and Joblib developer Olivier Grisel, doing exactly that at PyData Berlin: https://youtu.be/Ll6qWDbRTD0?t=1561

Try Dask-ML

You could also try the Dask-ML package, which has a RandomizedSearchCV object that is API compatible with scikit-learn but computationally implemented on top of Dask

https://github.com/dask/dask-ml

pip install dask-ml

这篇关于在HPC上使用scikit-learn函数的并行选项的简便方法的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

dask