问题描述
我可以访问由Slurm运行的群集,其中每个节点具有4个GPU。
I have access to a cluster that's run by Slurm, in which each node has 4 GPUs.
我有一个需要8 gpus的代码。
I have a code that needs 8 gpus.
所以问题是如何在每个节点只有4 gpu的群集上请求8 gpu?
所以这是我尝试通过 sbatch
提交的工作:
So this is the job that I tried to submit via sbatch
:
#!/bin/bash
#SBATCH --gres=gpu:8
#SBATCH --nodes=2
#SBATCH --mem=16000M
#SBATCH --time=0-01:00
但是随后出现以下错误:
But then I get the following error:
sbatch: error: Batch job submission failed: Requested node configuration is not available
然后我将设置更改为此并再次提交:
Then I changed my the settings to this and submitted again:
#!/bin/bash
#SBATCH --gres=gpu:4
#SBATCH --nodes=2
#SBATCH --mem=16000M
#SBATCH --time=0-01:00
nvidia-smi
及其结果仅显示4 gpu而非8。
and the result shows only 4 gpus not 8.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66 Driver Version: 375.66 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 0000:03:00.0 Off | 0 |
| N/A 32C P0 31W / 250W | 0MiB / 12193MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-PCIE... Off | 0000:04:00.0 Off | 0 |
| N/A 37C P0 29W / 250W | 0MiB / 12193MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P100-PCIE... Off | 0000:82:00.0 Off | 0 |
| N/A 35C P0 28W / 250W | 0MiB / 12193MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P100-PCIE... Off | 0000:83:00.0 Off | 0 |
| N/A 33C P0 26W / 250W | 0MiB / 12193MiB | 4% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
谢谢。
推荐答案
Slurm不支持您所需要的。它只能分配给您的作业GPU /节点,而不能分配给GPU /集群。
因此,与CPU或其他消耗性资源不同,GPU不消耗性,而是绑定到承载它们的节点。
Slurm does not support what you need. It only can assign to your job GPUs/node, not GPUs/cluster.So, unlike CPUs or other consumable resources, GPUs are not consumable and are binded to the node where they are hosted.
如果您对此主题感兴趣,有一项研究工作试图将GPU变成可消耗的资源,请查看 。
您将找到如何使用GPU虚拟化技术来做到这一点。
If you are interested in this topic, there is a research effort to turn GPUs into consumable resources, check this paper.There you'll find how to do it using GPU virtualization technologies.
这篇关于如何使用Slurm访问群集中不同节点上的GPU?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!