本文介绍了什么是具有强度 1 边缘矩阵的设备互连 StreamExecutor的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有四块 NVIDIA GTX 1080 显卡,当我初始化会话时,我看到以下控制台输出:

添加可见的gpu设备:0, 1, 2, 3具有强度 1 边缘矩阵的设备互连 StreamExecutor:0 1 2 30:N Y N N1: Y N N N2: N N N Y3:N N Y N

我还有 2 个 NVIDIA M60 Tesla 显卡,初始化看起来像:

添加可见的gpu设备:0, 1, 2, 3具有强度 1 边缘矩阵的设备互连 StreamExecutor:0 1 2 30:N N N N1:N N N N2: N N N N3:N N N N

而且我注意到自从上次将 1080 gpu 从 1.6 更新到 1.8 以来,这个输出对我来说发生了变化.它看起来像这样(记不清了,只是回忆):

 添加可见的gpu设备:0, 1, 2, 3具有强度 1 边缘矩阵的设备互连 StreamExecutor:0 1 2 3 0 1 2 30:Y N N N 0:N N Y N1:N Y N N 或 1:N N N Y2:N N Y N 2:Y N N N3: N N N Y 3: N Y N N

我的问题是:

  • 这是什么设备互连?
  • 它对计算能力有什么影响?
  • 为什么不同的 GPU 会有所不同?
  • 是否会因硬件原因(故障、驱动程序不一致...)而随时间变化?

解决方案

TL;DR

这是什么设备互连?

正如 Almog David 在评论中所说,这会告诉您一个 GPU 是否可以直接访问另一个 GPU.

它对计算能力有什么影响?

这唯一的影响是用于多 GPU 训练.如果两个 GPU 具有设备互连,则数据传输速度更快.

为什么不同的 GPU 会有不同?

这取决于硬件设置的拓扑结构.一块主板只有这么多的 PCI-e 插槽,它们通过同一条总线连接.(使用 nvidia-smi topo -m 检查拓扑)

是否会因硬件原因(故障、驱动程序不一致...)而随时间变化?

我不认为顺序可以随时间改变,除非 NVIDIA 更改默认枚举方案.这里

有更多细节

说明

此消息是在 BaseGPUDeviceFactory::CreateDevices 函数.它按照给定的顺序遍历每对设备并调用cuDeviceCanAccessPeer.正如 Almog David 在评论中所说,这只是表明您是否可以在设备之间执行 DMA.

您可以进行一些测试以检查订单是否重要.考虑以下代码段:

#test.py将张量流导入为 tf#允许增长占用最少的资源配置 = tf.ConfigProto()config.gpu_options.allow_growth = Truesess = tf.Session(config=config)

现在让我们检查CUDA_VISIBLE_DEVICES

中不同设备顺序的输出

$ CUDA_VISIBLE_DEVICES=0,1,2,3 python3 test.py...2019-03-26 15:26:16.111423:我 tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] 添加可见的 gpu 设备:0、1、2、32019-03-26 15:26:18.635894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 具有强度 1 边缘矩阵的设备互连 StreamExecutor:2019-03-26 15:26:18.635965:我张量流/核心/common_runtime/gpu/gpu_device.cc:988] 0 1 2 32019-03-26 15:26:18.635974: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y N N2019-03-26 15:26:18.635982: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N N N2019-03-26 15:26:18.635987: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: N N N Y2019-03-26 15:26:18.636010: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: N N Y N...$ CUDA_VISIBLE_DEVICES=2,0,1,3 python3 test.py...2019-03-26 15:26:30.090493:我 tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] 添加可见的 gpu 设备:0、1、2、32019-03-26 15:26:32.758272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 具有强度 1 边缘矩阵的设备互连 StreamExecutor:2019-03-26 15:26:32.758349:我张量流/核心/common_runtime/gpu/gpu_device.cc:988] 0 1 2 32019-03-26 15:26:32.758358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N N N Y2019-03-26 15:26:32.758364: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: N N Y N2019-03-26 15:26:32.758389: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: N Y N N2019-03-26 15:26:32.758412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: Y N N N...

您可以通过运行 nvidia-smi topo -m 获得更详细的连接说明.例如:

 GPU0 GPU1 GPU2 GPU3 CPU AffinityGPU0 X PHB SYS SYS 0-7,16-23GPU1 PHB X SYS SYS 0-7,16-23GPU2 SYS SYS X PHB 8-15,24-31GPU3 SYS SYS PHB X 8-15,24-31传奇:X = 自己SYS = 连接穿越 PCIe 以及 NUMA 节点之间的 SMP 互连(例如 QPI/UPI)NODE = 穿过 PCIe 的连接以及 NUMA 节点内 PCIe 主机桥之间的互连PHB = 连接穿越 PCIe 以及 PCIe 主机桥(通常是 CPU)PXB = 连接穿越多个 PCIe 交换机(不穿越 PCIe 主机桥)PIX = 连接穿过单个 PCIe 交换机NV# = 连接遍历一组绑定的 # NVLinks

我相信您在列表中的位置越低,传输速度就越快.

I have four NVIDIA GTX 1080 graphic cards and when I'm initializing a session I see following console output:

Adding visible gpu devices: 0, 1, 2, 3
 Device interconnect StreamExecutor with strength 1 edge matrix:
      0 1 2 3
 0:   N Y N N
 1:   Y N N N
 2:   N N N Y
 3:   N N Y N

And as well I have 2 NVIDIA M60 Tesla graphic cards and the initialization looks like:

Adding visible gpu devices: 0, 1, 2, 3
 Device interconnect StreamExecutor with strength 1 edge matrix:
      0 1 2 3
 0:   N N N N
 1:   N N N N
 2:   N N N N
 3:   N N N N

And I noticed this output was changed for me since last update from 1.6 to 1.8 for 1080 gpu. It looked something like this (cannot remember precisely, just memories):

 Adding visible gpu devices: 0, 1, 2, 3
Device interconnect StreamExecutor with strength 1 edge matrix:
     0 1 2 3            0 1 2 3
0:   Y N N N         0: N N Y N
1:   N Y N N    or   1: N N N Y
2:   N N Y N         2: Y N N N
3:   N N N Y         3: N Y N N

My questions are:

  • what is this Device interconnect?
  • what influence it has on computation power?
  • why it differ for different GPUs?
  • can it change over time due to hardware reasons (failures, drivers inconsistency...)?

解决方案

TL;DR

As stated by Almog David in the comments, this tells you if one GPU has direct memory access to the other.

The only effect this has is for multi-GPU training. The data transfer is faster if the two GPUs have device interconnect.

This depends on the topology of the hardware setup. A motherboard only has so many PCI-e slots that are connected by the same bus. (check topology with nvidia-smi topo -m)

I don't think that the order can change over time, unless NVIDIA changes the default enumeration scheme. There is a little more detail here

Explaination

This message is generated in the BaseGPUDeviceFactory::CreateDevices function. It iterates through each pair of devices in the given order and calls cuDeviceCanAccessPeer. As mentioned by Almog David says in the comments, this just indicates whether you can perform DMA between devices.

You can perform a little test to check that the order matters. Consider the following snippet:

#test.py
import tensorflow as tf

#allow growth to take up minimal resources
config = tf.ConfigProto()
config.gpu_options.allow_growth = True

sess = tf.Session(config=config)

Now let's check the output with different device order in CUDA_VISIBLE_DEVICES

$ CUDA_VISIBLE_DEVICES=0,1,2,3 python3 test.py
...
2019-03-26 15:26:16.111423: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3
2019-03-26 15:26:18.635894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-26 15:26:18.635965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 1 2 3
2019-03-26 15:26:18.635974: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N Y N N
2019-03-26 15:26:18.635982: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1:   Y N N N
2019-03-26 15:26:18.635987: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2:   N N N Y
2019-03-26 15:26:18.636010: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3:   N N Y N
...

$ CUDA_VISIBLE_DEVICES=2,0,1,3 python3 test.py
...
2019-03-26 15:26:30.090493: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3
2019-03-26 15:26:32.758272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-26 15:26:32.758349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 1 2 3
2019-03-26 15:26:32.758358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N N N Y
2019-03-26 15:26:32.758364: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1:   N N Y N
2019-03-26 15:26:32.758389: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2:   N Y N N
2019-03-26 15:26:32.758412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3:   Y N N N
...

You can get a more detailed explanation of the connections by running nvidia-smi topo -m. For example:

       GPU0      GPU1    GPU2   GPU3    CPU Affinity
GPU0     X       PHB    SYS     SYS     0-7,16-23
GPU1    PHB       X     SYS     SYS     0-7,16-23
GPU2    SYS      SYS     X      PHB     8-15,24-31
GPU3    SYS      SYS    PHB      X      8-15,24-31

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

I believe the lower you go on the list, the faster the transfer.

这篇关于什么是具有强度 1 边缘矩阵的设备互连 StreamExecutor的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-30 04:18