问题描述
我们刚刚开始使用Slurm来管理我们的GPU(目前只有2个).我们使用ubuntu 14.04和slurm-llnl.我已经配置了gres.conf并且srun
可以工作.问题是,如果我使用--gres=gpu:1
运行两个作业,则两个GPU已成功分配,并且这些作业开始运行.现在,我希望能够在不使用--gres=gpu:1
的情况下运行更多的作业(除了2个GPU作业之外)(即,不仅仅使用CPU和ram的作业),但是这是不可能的.
We have just started using slurm for managing our GPUs (currently just 2). We use ubuntu 14.04 and slurm-llnl. I have configured gres.conf and srun
works.The problem is that if I run two jobs with --gres=gpu:1
then the two GPUs are successfully allocated and the jobs start running; now I expect to be able to run more jobs (in addition to the 2 GPU jobs) without --gres=gpu:1
(i.e. jobs than only use CPU and ram) but it is not possible.
该错误消息表明它无法分配所需的资源(即使有24个CPU内核).
The error message says that it could not allocate required resources (even though there are 24 CPU cores).
这是我的gres.conf:
This is my gres.conf:
Name=gpu Type=titanx File=/dev/nvidia0
Name=gpu Type=titanx File=/dev/nvidia1
NodeName=ubuntu Name=gpu Type=titanx File=/dev/nvidia[0-1]
感谢您的帮助.谢谢.
推荐答案
确保配置中的SelectType
是CR_CPU
或CR_Core
,并且分区的shared
选项未设置为.否则,Slurm会将完整的节点分配给作业.
Make sure that SelectType
in your configuration is CR_CPU
or CR_Core
and that the shared
option of the partition is not set to exclusive
. Otherwise Slurm allocates full nodes to jobs.
这篇关于回答:分配完所有GPU后,无法再提交cpu作业的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!