本文介绍了如何在 Infiniband 中使用 GPUDirect RDMA的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两台机器.每台机器上有多张特斯拉卡.每台机器上还有一张 InfiniBand 卡.我想通过 InfiniBand 在不同机器上的 GPU 卡之间进行通信.只需点对点单播就可以了.我当然想使用 GPUDirect RDMA,这样我就可以省去额外的复制操作.

I have two machines. There are multiple Tesla cards on each machine. There is also an InfiniBand card on each machine. I want to communicate between GPU cards on different machines through InfiniBand. Just point to point unicast would be fine. I surely want to use GPUDirect RDMA so I could spare myself of extra copy operations.

我知道 Mellanox 现在提供 驱动程序InfiniBand 卡.但它没有提供详细的开发指南.我也知道 OpenMPI 支持我要求的功能.但是 OpenMPI 对于这个简单的任务来说太重了,它不支持单个进程中的多个 GPU.

I am aware that there is a driver available now from Mellanox for its InfiniBand cards. But it doesn't offer a detailed development guide. Also I am aware that OpenMPI has support for the feature I am asking. But OpenMPI is too heavy weight for this simple task and it does not support multiple GPUs in a single process.

我想知道直接使用驱动程序进行通信是否可以得到任何帮助.代码示例,教程,任何东西都会很好.另外,如果有人能帮我在 OpenMPI 中找到处理这个问题的代码,我将不胜感激.

I wonder if I could get any help with directly using the driver to do the communication. Code sample, tutorial, anything would be good. Also, I would appreciate it if anyone could help me find the code dealing with this in OpenMPI.

推荐答案

要使 GPUDirect RDMA 工作,您需要安装以下内容:

For GPUDirect RDMA to work, you need the following installed:

最近安装的 NVIDIA CUDA 套件

Recent NVIDIA CUDA suite installed

应安装以上所有内容(按上面列出的顺序),并加载相关模块.之后,您应该能够注册在 GPU 视频内存上分配的内存用于 RDMA 事务.示例代码如下所示:

All of the above should be installed (by the order listed above), and the relevant modules loaded.After that, you should be able to register memory allocated on the GPU video memory for RDMA transactions. Sample code will look like:

void * gpu_buffer;
struct ibv_mr *mr;
const int size = 64*1024;
cudaMalloc(&gpu_buffer,size); // TODO: Check errors
mr = ibv_reg_mr(pd,gpu_buffer,size,IBV_ACCESS_LOCAL_WRITE|IBV_ACCESS_REMOTE_WRITE|IBV_ACCESS_REMOTE_READ);

这将创建(在启用 GPUDirect RDMA 的系统上)一个内存区域,其中包含一个有效的内存密钥,您可以将其用于与我们的 HCA 进行 RDMA 事务.

This will create (on a GPUDirect RDMA enabled system) a memory region, with a valid memory key that you can use for RDMA transactions with our HCA.

有关在代码中使用 RDMA 和 InfiniBand 动词的更多详细信息,您可以参考此 文档.

For more details about using RDMA and InfiniBand verbs in your code, you can refer to this document.

这篇关于如何在 Infiniband 中使用 GPUDirect RDMA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-20 00:15