本文介绍了如何使用Slurm访问群集中不同节点上的GPU?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以访问由Slurm运行的群集,其中每个节点具有4个GPU。

I have access to a cluster that's run by Slurm, in which each node has 4 GPUs.

我有一个需要8 gpus的代码。

I have a code that needs 8 gpus.

所以问题是如何在每个节点只有4 gpu的群集上请求8 gpu?

所以这是我尝试通过 sbatch 提交的工作:

So this is the job that I tried to submit via sbatch:

#!/bin/bash
#SBATCH --gres=gpu:8
#SBATCH --nodes=2
#SBATCH --mem=16000M
#SBATCH --time=0-01:00

但是随后出现以下错误:

But then I get the following error:

sbatch: error: Batch job submission failed: Requested node configuration is not available

然后我将设置更改为此并再次提交:

Then I changed my the settings to this and submitted again:

#!/bin/bash
#SBATCH --gres=gpu:4
#SBATCH --nodes=2
#SBATCH --mem=16000M
#SBATCH --time=0-01:00
nvidia-smi

及其结果仅显示4 gpu而非8。

and the result shows only 4 gpus not 8.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 0000:03:00.0     Off |                    0 |
| N/A   32C    P0    31W / 250W |      0MiB / 12193MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  Off  | 0000:04:00.0     Off |                    0 |
| N/A   37C    P0    29W / 250W |      0MiB / 12193MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-PCIE...  Off  | 0000:82:00.0     Off |                    0 |
| N/A   35C    P0    28W / 250W |      0MiB / 12193MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P100-PCIE...  Off  | 0000:83:00.0     Off |                    0 |
| N/A   33C    P0    26W / 250W |      0MiB / 12193MiB |      4%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

谢谢。

推荐答案

Slurm不支持​​您所需要的。它只能分配给您的作业GPU /节点,而不能分配给GPU /集群。
因此,与CPU或其他消耗性资源不同,GPU不消耗性,而是绑定到承载它们的节点。

Slurm does not support what you need. It only can assign to your job GPUs/node, not GPUs/cluster.So, unlike CPUs or other consumable resources, GPUs are not consumable and are binded to the node where they are hosted.

如果您对此主题感兴趣,有一项研究工作试图将GPU变成可消耗的资源,请查看 。
您将找到如何使用GPU虚拟化技术来做到这一点。

If you are interested in this topic, there is a research effort to turn GPUs into consumable resources, check this paper.There you'll find how to do it using GPU virtualization technologies.

这篇关于如何使用Slurm访问群集中不同节点上的GPU?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-04 04:43
查看更多