问题描述
我有四块 NVIDIA GTX 1080 显卡,当我初始化会话时,我看到以下控制台输出:
添加可见的gpu设备:0, 1, 2, 3具有强度 1 边缘矩阵的设备互连 StreamExecutor:0 1 2 30:N Y N N1: Y N N N2: N N N Y3:N N Y N
我还有 2 个 NVIDIA M60 Tesla 显卡,初始化看起来像:
添加可见的gpu设备:0, 1, 2, 3具有强度 1 边缘矩阵的设备互连 StreamExecutor:0 1 2 30:N N N N1:N N N N2: N N N N3:N N N N
而且我注意到自从上次将 1080 gpu 从 1.6 更新到 1.8 以来,这个输出对我来说发生了变化.它看起来像这样(记不清了,只是回忆):
添加可见的gpu设备:0, 1, 2, 3具有强度 1 边缘矩阵的设备互连 StreamExecutor:0 1 2 3 0 1 2 30:Y N N N 0:N N Y N1:N Y N N 或 1:N N N Y2:N N Y N 2:Y N N N3: N N N Y 3: N Y N N
我的问题是:
- 这是什么设备互连?
- 它对计算能力有什么影响?
- 为什么不同的 GPU 会有所不同?
- 是否会因硬件原因(故障、驱动程序不一致...)而随时间变化?
TL;DR
这是什么设备互连?
正如 Almog David 在评论中所说,这会告诉您一个 GPU 是否可以直接访问另一个 GPU.
它对计算能力有什么影响?
这唯一的影响是用于多 GPU 训练.如果两个 GPU 具有设备互连,则数据传输速度更快.
为什么不同的 GPU 会有不同?
这取决于硬件设置的拓扑结构.一块主板只有这么多的 PCI-e 插槽,它们通过同一条总线连接.(使用 nvidia-smi topo -m
检查拓扑)
是否会因硬件原因(故障、驱动程序不一致...)而随时间变化?
我不认为顺序可以随时间改变,除非 NVIDIA 更改默认枚举方案.这里
有更多细节说明
此消息是在 BaseGPUDeviceFactory::CreateDevices
函数.它按照给定的顺序遍历每对设备并调用cuDeviceCanAccessPeer
.正如 Almog David 在评论中所说,这只是表明您是否可以在设备之间执行 DMA.
您可以进行一些测试以检查订单是否重要.考虑以下代码段:
#test.py将张量流导入为 tf#允许增长占用最少的资源配置 = tf.ConfigProto()config.gpu_options.allow_growth = Truesess = tf.Session(config=config)
现在让我们检查CUDA_VISIBLE_DEVICES
$ CUDA_VISIBLE_DEVICES=0,1,2,3 python3 test.py...2019-03-26 15:26:16.111423:我 tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] 添加可见的 gpu 设备:0、1、2、32019-03-26 15:26:18.635894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 具有强度 1 边缘矩阵的设备互连 StreamExecutor:2019-03-26 15:26:18.635965:我张量流/核心/common_runtime/gpu/gpu_device.cc:988] 0 1 2 32019-03-26 15:26:18.635974: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y N N2019-03-26 15:26:18.635982: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N N N2019-03-26 15:26:18.635987: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: N N N Y2019-03-26 15:26:18.636010: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: N N Y N...$ CUDA_VISIBLE_DEVICES=2,0,1,3 python3 test.py...2019-03-26 15:26:30.090493:我 tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] 添加可见的 gpu 设备:0、1、2、32019-03-26 15:26:32.758272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 具有强度 1 边缘矩阵的设备互连 StreamExecutor:2019-03-26 15:26:32.758349:我张量流/核心/common_runtime/gpu/gpu_device.cc:988] 0 1 2 32019-03-26 15:26:32.758358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N N N Y2019-03-26 15:26:32.758364: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: N N Y N2019-03-26 15:26:32.758389: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: N Y N N2019-03-26 15:26:32.758412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: Y N N N...
您可以通过运行 nvidia-smi topo -m
获得更详细的连接说明.例如:
GPU0 GPU1 GPU2 GPU3 CPU AffinityGPU0 X PHB SYS SYS 0-7,16-23GPU1 PHB X SYS SYS 0-7,16-23GPU2 SYS SYS X PHB 8-15,24-31GPU3 SYS SYS PHB X 8-15,24-31传奇:X = 自己SYS = 连接穿越 PCIe 以及 NUMA 节点之间的 SMP 互连(例如 QPI/UPI)NODE = 穿过 PCIe 的连接以及 NUMA 节点内 PCIe 主机桥之间的互连PHB = 连接穿越 PCIe 以及 PCIe 主机桥(通常是 CPU)PXB = 连接穿越多个 PCIe 交换机(不穿越 PCIe 主机桥)PIX = 连接穿过单个 PCIe 交换机NV# = 连接遍历一组绑定的 # NVLinks
我相信您在列表中的位置越低,传输速度就越快.
I have four NVIDIA GTX 1080 graphic cards and when I'm initializing a session I see following console output:
Adding visible gpu devices: 0, 1, 2, 3
Device interconnect StreamExecutor with strength 1 edge matrix:
0 1 2 3
0: N Y N N
1: Y N N N
2: N N N Y
3: N N Y N
And as well I have 2 NVIDIA M60 Tesla graphic cards and the initialization looks like:
Adding visible gpu devices: 0, 1, 2, 3
Device interconnect StreamExecutor with strength 1 edge matrix:
0 1 2 3
0: N N N N
1: N N N N
2: N N N N
3: N N N N
And I noticed this output was changed for me since last update from 1.6 to 1.8 for 1080 gpu. It looked something like this (cannot remember precisely, just memories):
Adding visible gpu devices: 0, 1, 2, 3
Device interconnect StreamExecutor with strength 1 edge matrix:
0 1 2 3 0 1 2 3
0: Y N N N 0: N N Y N
1: N Y N N or 1: N N N Y
2: N N Y N 2: Y N N N
3: N N N Y 3: N Y N N
My questions are:
- what is this Device interconnect?
- what influence it has on computation power?
- why it differ for different GPUs?
- can it change over time due to hardware reasons (failures, drivers inconsistency...)?
TL;DR
As stated by Almog David in the comments, this tells you if one GPU has direct memory access to the other.
The only effect this has is for multi-GPU training. The data transfer is faster if the two GPUs have device interconnect.
This depends on the topology of the hardware setup. A motherboard only has so many PCI-e slots that are connected by the same bus. (check topology with nvidia-smi topo -m
)
I don't think that the order can change over time, unless NVIDIA changes the default enumeration scheme. There is a little more detail here
Explaination
This message is generated in the BaseGPUDeviceFactory::CreateDevices
function. It iterates through each pair of devices in the given order and calls cuDeviceCanAccessPeer
. As mentioned by Almog David says in the comments, this just indicates whether you can perform DMA between devices.
You can perform a little test to check that the order matters. Consider the following snippet:
#test.py
import tensorflow as tf
#allow growth to take up minimal resources
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)
Now let's check the output with different device order in CUDA_VISIBLE_DEVICES
$ CUDA_VISIBLE_DEVICES=0,1,2,3 python3 test.py
...
2019-03-26 15:26:16.111423: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3
2019-03-26 15:26:18.635894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-26 15:26:18.635965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3
2019-03-26 15:26:18.635974: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y N N
2019-03-26 15:26:18.635982: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N N N
2019-03-26 15:26:18.635987: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: N N N Y
2019-03-26 15:26:18.636010: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: N N Y N
...
$ CUDA_VISIBLE_DEVICES=2,0,1,3 python3 test.py
...
2019-03-26 15:26:30.090493: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3
2019-03-26 15:26:32.758272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-26 15:26:32.758349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3
2019-03-26 15:26:32.758358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N N N Y
2019-03-26 15:26:32.758364: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: N N Y N
2019-03-26 15:26:32.758389: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: N Y N N
2019-03-26 15:26:32.758412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: Y N N N
...
You can get a more detailed explanation of the connections by running nvidia-smi topo -m
. For example:
GPU0 GPU1 GPU2 GPU3 CPU Affinity
GPU0 X PHB SYS SYS 0-7,16-23
GPU1 PHB X SYS SYS 0-7,16-23
GPU2 SYS SYS X PHB 8-15,24-31
GPU3 SYS SYS PHB X 8-15,24-31
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks
I believe the lower you go on the list, the faster the transfer.
这篇关于什么是具有强度 1 边缘矩阵的设备互连 StreamExecutor的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!