问题描述
在scikit-learn的许多函数中,实现了用户友好的并行化.例如在sklearn.cross_validation.cross_val_score
您只需在n_jobs
参数中传递所需数量的计算作业即可.对于带有多核处理器的PC,它将非常好用.但是,如果我想在高性能群集中使用此类选项(已安装OpenMPI软件包,并使用SLURM进行资源管理)?据我所知,sklearn
使用joblib
进行并行化,而并行化则使用multiprocessing
.而且,据我所知(例如, mpi中的Python多重处理),Python程序已并行化使用multiprocessing
易于扩展,通过mpirun
实用程序可以扩展整个MPI体系结构.是否可以仅使用mpirun
和n_jobs
参数在多个计算节点上扩展sklearn
函数的计算?
SKLearn通过 Joblib 来管理其并行性. . Joblib可以将多处理后端换成其他分布式系统,例如 dask.distributed 或 IPython并行.有关详细信息,请参见sklearn
github页面上的此问题. /p>
将Joblib与Dask.distributed一起使用的示例
从上面链接的问题页面获取的代码.
from sklearn.externals.joblib import parallel_backend
search = RandomizedSearchCV(model, param_space, cv=10, n_iter=1000, verbose=1)
with parallel_backend('dask', scheduler_host='your_scheduler_host:your_port'):
search.fit(digits.data, digits.target)
这要求您在集群上设置dask.distributed
调度程序和工作程序.可在此处获得常规说明: http://dask.readthedocs.io/en/latest/setup.html
将Joblib与ipyparallel
一起使用的示例从同一问题页面获取的代码.
from sklearn.externals.joblib import Parallel, parallel_backend, register_parallel_backend
from ipyparallel import Client
from ipyparallel.joblib import IPythonParallelBackend
digits = load_digits()
c = Client(profile='myprofile')
print(c.ids)
bview = c.load_balanced_view()
# this is taken from the ipyparallel source code
register_parallel_backend('ipyparallel', lambda : IPythonParallelBackend(view=bview))
...
with parallel_backend('ipyparallel'):
search.fit(digits.data, digits.target)
注意:在以上两个示例中,n_jobs
参数似乎都不再重要了.
设置带有SLURM分发的dask.
对于SLURM,最简单的方法可能是使用 dask-jobqueue 项目
>>> from dask_jobqueue import SLURMCluster
>>> cluster = SLURMCluster(project='...', queue='...', ...)
>>> cluster.scale(20)
直接使用dask.distributed
或者,您可以设置dask.distributed或IPyParallel集群,然后直接使用这些接口来并行化SKLearn代码.以下是SKLearn和Joblib开发人员Olivier Grisel在PyData Berlin上所做的示例视频: https://youtu.be/Ll6qWDbRTD0?t=1561
尝试Dask-ML
您还可以尝试使用Dask-ML软件包,该软件包具有一个RandomizedSearchCV
对象,该对象与scikit-learn API兼容,但在Dask的基础上通过计算实现了
https://github.com/dask/dask-ml
pip install dask-ml
In many functions from scikit-learn implemented user-friendly parallelization. For example in sklearn.cross_validation.cross_val_score
you just pass desired number of computational jobs in n_jobs
argument. And for PC with multi-core processor it will work very nice. But if I want use such option in high performance cluster (with installed OpenMPI package and using SLURM for resource management) ? As I know sklearn
uses joblib
for parallelization, which uses multiprocessing
. And, as I know (from this, for example, Python multiprocessing within mpi) Python programs parallelized with multiprocessing
easy to scale oh whole MPI architecture with mpirun
utility. Can I spread computation of sklearn
functions on several computational nodes just using mpirun
and n_jobs
argument?
SKLearn manages its parallelism with Joblib. Joblib can swap out the multiprocessing backend for other distributed systems like dask.distributed or IPython Parallel. See this issue on the sklearn
github page for details.
Example using Joblib with Dask.distributed
Code taken from the issue page linked above.
from sklearn.externals.joblib import parallel_backend
search = RandomizedSearchCV(model, param_space, cv=10, n_iter=1000, verbose=1)
with parallel_backend('dask', scheduler_host='your_scheduler_host:your_port'):
search.fit(digits.data, digits.target)
This requires that you set up a dask.distributed
scheduler and workers on your cluster. General instructions are available here: http://dask.readthedocs.io/en/latest/setup.html
Example using Joblib with ipyparallel
Code taken from the same issue page.
from sklearn.externals.joblib import Parallel, parallel_backend, register_parallel_backend
from ipyparallel import Client
from ipyparallel.joblib import IPythonParallelBackend
digits = load_digits()
c = Client(profile='myprofile')
print(c.ids)
bview = c.load_balanced_view()
# this is taken from the ipyparallel source code
register_parallel_backend('ipyparallel', lambda : IPythonParallelBackend(view=bview))
...
with parallel_backend('ipyparallel'):
search.fit(digits.data, digits.target)
Note: in both the above examples, the n_jobs
parameter seems to not matter anymore.
Set up dask.distributed with SLURM
For SLURM the easiest way to do this is probably to use the dask-jobqueue project
>>> from dask_jobqueue import SLURMCluster
>>> cluster = SLURMCluster(project='...', queue='...', ...)
>>> cluster.scale(20)
You could also use dask-mpi or any of several other methods mentioned at Dask's setup documentation
Use dask.distributed directly
Alternatively you can set up a dask.distributed or IPyParallel cluster and then use these interfaces directly to parallelize your SKLearn code. Here is an example video of SKLearn and Joblib developer Olivier Grisel, doing exactly that at PyData Berlin: https://youtu.be/Ll6qWDbRTD0?t=1561
Try Dask-ML
You could also try the Dask-ML package, which has a RandomizedSearchCV
object that is API compatible with scikit-learn but computationally implemented on top of Dask
https://github.com/dask/dask-ml
pip install dask-ml
这篇关于在HPC上使用scikit-learn函数的并行选项的简便方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!