基础知识

对于 Docker 等大多数 Linux 容器来说,Cgroups 技术是用来制造约束的主要手段,而 Namespace 技术则是用来修改进程视图的主要方法。

Docker 启动的只是一个进程而已,而不是别的。

参考:

隔离(Namespace)

写代码调用 clone 的时候,传入 CLONE_NEWPID/CLONE_NEWNS/CLONE_NEWUTS/CLONE_NEWNET/CLONE_NEWIPC 等就可以启动一个被隔离的进程

简单来说 Namespace 是一个障眼法:

  • PID Namespace
  • Mount 只能看到当前 Namespace 中的挂载点信息
  • UTS
  • IPC
  • Network 只能看到当前 Namespace 中的网络设备
  • User
  • 时间是不可以 Namespace 化,即在某个容器内修改了系统时间,该 host 上所有 container 和 host 的系统时间都将被改变

限制(Cgroup)

Linux Control Group。它最主要的作用,就是限制一个进程组能够使用的资源上限,包括 CPU、内存、磁盘、网络带宽等等。

Cgroups 给用户暴露出来的操作接口是文件系统,即它以文件和目录的方式组织在操作系统的 /sys/fs/cgroup 路径下。

启动容器时填写:

docker run -it --cpu-period=100000 --cpu-quota=20000 ubuntu /bin/bash

在启动这个容器后,我们可以通过查看 Cgroups 文件系统下,CPU 子系统中,“docker” 这个控制组里的资源限制文件的内容来确认:

$ cat /sys/fs/cgroup/cpu/docker/5d5c9f67d/cpu.cfs_period_us
100000
$ cat /sys/fs/cgroup/cpu/docker/5d5c9f67d/cpu.cfs_quota_us
20000

挂载 GPU 实验

使用 nvidia-docker2

简言之,使用 nvidia-docker2,可以不费吹灰之力就能使用到 GPU,仅仅需要配置 runtime 使用 nvidia

cat /etc/docker/daemon.json
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "exec-opts": ["native.cgroupdriver=systemd"]
}

启动容器之后,运行 nvidia-smi 能看到所有的 GPU 卡:

[root@localhost] docker run -it 98b41a1e975d bash
root@6db1dd28459d:/notebooks# nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   40C    P0    57W / 300W |   4053MiB / 16130MiB |      4%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:8B:00.0 Off |                    0 |
| N/A   38C    P0    40W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:8C:00.0 Off |                    0 |
| N/A   42C    P0    46W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:8D:00.0 Off |                    0 |
| N/A   39C    P0    40W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:B3:00.0 Off |                    0 |
| N/A   39C    P0    42W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:B4:00.0 Off |                    0 |
| N/A   41C    P0    57W / 300W |   7279MiB / 16130MiB |      4%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:B5:00.0 Off |                    0 |
| N/A   40C    P0    45W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:B6:00.0 Off |                    0 |
| N/A   41C    P0    44W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

通过 NVIDIA_DRIVER_CAPABILITIES 可以加入部分的 library。通过 NVIDIA_VISIBLE_DEVICES 可以只使用某些 GPU 卡,具体请参考 如何通过 nvidia-docker 通过环境变量配置资源

[root@mesos-gpu-v100-online020-bdwg cuda-9.0]# docker run -it  --env NVIDIA_DRIVER_CAPABILITIES="compute,utility"  --env NVIDIA_VISIBLE_DEVICES=0,1 98b41a1e975d bash
root@97bf127ff83a:/notebooks# nvidia-smi
Tue Oct 15 09:29:45 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   39C    P0    57W / 300W |   4053MiB / 16130MiB |      3%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:8B:00.0 Off |                    0 |
| N/A   37C    P0    40W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

原生 docker 使用 GPU

原生 docker 使用 GPU 遇到了很多坑,首先需要将 runtime 换回 default 值:

[root@localhost ~]# cat /etc/docker/daemon.json
{
  "exec-opts": ["native.cgroupdriver=systemd"]
}

重启 docker 服务后,尝试直接挂载 GPU:

docker run --device /dev/nvidia0:/dev/nvidia0 -it 98b41a1e975d bash

root@a85d5e5f69d9:/notebooks# nvidia-smi
bash: nvidia-smi: command not found

root@a85d5e5f69d9:/notebooks# ll /dev/|grep nvidia
crw-rw-rw-  1 root root 195, 0 Oct 15 06:06 nvidia0

nvidia-smi 不存在,那么我们可以把宿主机中的 nvidia-smi 所在目录直接映射进去:

[root@mesos-gpu-v100-online020-bdwg cuda-9.0]# docker run --device /dev/nvidia0:/dev/nvidia0 -v /usr/bin/:/usr/bin -it 98b41a1e975d bash
root@cf29b4477304:/notebooks# nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

libnvidia-ml.so 找不到,libnvidia-ml.so 实际上是 Nvidia Management Library库(简称 NVML 库),它属于 Nvidia Driver 的范畴。nvidia-smi 通过调用 libnvidia-ml.so 来管理 GPU。因此我们需要把它也挂载进去:

[root@mesos-gpu-v100-online020-bdwg cuda-9.0]# docker run --device /dev/nvidia0:/dev/nvidia0 -v /usr/bin/:/usr/bin -v /usr/lib64:/usr/lib64 -it 98b41a1e975d bash
root@ee39b2b3b1a4:/notebooks# nvidia-smi
Failed to initialize NVML: Unknown Error

Failed to initialize NVML: Unknown Error 出现了初始化 NVML 失败的问题,NVML 库会和 Nvidia Driver 通信,会不会是通信受阻?于是查看 Nvidia 内核模块有哪些,是否需要将其全部映射进容器?

[root@mesos-gpu-v100-online020-bdwg cuda-9.0]# lsmod|grep nvidia
nvidia_drm             39843  0
nvidia_modeset       1036498  1 nvidia_drm
nvidia_uvm            786729  0
nvidia              16594443  77 nvidia_modeset,nvidia_uvm
ipmi_msghandler        46608  3 ipmi_devintf,nvidia,ipmi_si
drm_kms_helper        163265  2 ast,nvidia_drm
drm                   370825  5 ast,ttm,drm_kms_helper,nvidia_drm
i2c_core               40756  6 ast,drm,i2c_i801,drm_kms_helper,i2c_algo_bit,nvidia

[root@mesos-gpu-v100-online020-bdwg cuda-9.0]# ll /dev/|grep nvidia
crw-rw-rw-  1 root root    195,   0 Jul 23 10:56 nvidia0
crw-rw-rw-  1 root root    195,   1 Jul 23 10:56 nvidia1
crw-rw-rw-  1 root root    195,   2 Jul 23 10:56 nvidia2
crw-rw-rw-  1 root root    195,   3 Jul 23 10:56 nvidia3
crw-rw-rw-  1 root root    195,   4 Jul 23 10:56 nvidia4
crw-rw-rw-  1 root root    195,   5 Jul 23 10:56 nvidia5
crw-rw-rw-  1 root root    195,   6 Jul 23 10:56 nvidia6
crw-rw-rw-  1 root root    195,   7 Jul 23 10:56 nvidia7
crw-rw-rw-  1 root root    195, 255 Jul 23 10:56 nvidiactl
crw-rw-rw-  1 root root    195, 254 Jul 23 10:56 nvidia-modeset
crw-rw-rw-  1 root root    237,   0 Jul 23 10:56 nvidia-uvm
crw-rw-rw-  1 root root    237,   1 Jul 23 10:56 nvidia-uvm-tools

综上,我们可以再次尝试,把 /dev/nvidiactl/dev/nvidia-uvm/dev/nvidia-uvm-tools/dev/nvidia-modeset 全部映射进去:

[root@mesos-gpu-v100-online020-bdwg cuda-9.0]# docker run --device /dev/nvidia0:/dev/nvidia0  --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools --device /dev/nvidia-modeset:/dev/nvidia-modeset -v /usr/bin/:/usr/bin -v /usr/lib64:/usr/lib64 -it 98b41a1e975d bash
root@bc21e395d885:/notebooks# nvidia-smi
Tue Oct 15 09:47:26 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   37C    P0    44W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

终于得到了我们期望的内容,这个探索的过程引起了我深深的思考,nvidia-docker 是如何做到的?莫非也是 --device + 映射 nvidia driver 来实现的?

nvidia-docker 原理

首先我们参考了:

和我们猜测的一样,nvidia-docker 确实是这么做的,nvidia-container-runtime 封装了 runc,在容器启动之前会调用 pre-start hook,这个 hook 会调用 nvidia-container-cli,nvidia-container-cli 会分析出需要映射的 GPU 设备、库文件、可执行文件,在容器启动后挂载到容器内部,达到配置好 GPU 环境的目的。

安装 Nvidia driver 驱动

因为在测试的过程中遇到了很多问题,首先就是对 Nvidia 提供的各种驱动不熟悉,不知道他们属于哪一层,导致有些乱,这里整理了下。

Nvidia GPU 相关驱动包含两类:

  • Nvidia driver
  • CUDA Toolkit

Nvidia driver

安装方法:

  • 下载这么一个东西 NVIDIA-Linux-x86_64-384.59.run 然后直接安装,安装后所有的文件默认在 /usr/local/nvidia 下,这也是为什么大多数教程上 docker -v /usr/local/nvidia:/usr/local/nvidia 的原因
  • 还有一种就是通过 rpm 来安装,配置好源之后,yum install cuda-drivers-410.79-1(注意自己修改版本),这种方式默认在 /usr/bin/usr/lib64

我动手把 Nvidia driver 主要的 rpm 包都解包了下:

库文件:

nvidia-driver-410.79-1.el7.x86_64.rpm               29MB 核心驱动

./usr/lib64/nvidia/xorg/libglxserver_nvidia.so      15M
./usr/lib64/xorg/modules/drivers/nvidia_drv.so      7.5M
nvidia-driver-libs-410.79-1.el7.x86_64.rpm      44MB  核心库文件

./etc/ld.so.conf.d/nvidia-x86_64.conf
./usr/lib64/libEGL_nvidia.so.410.79             1008K
./usr/lib64/libGLESv1_CM_nvidia.so.410.79       59K
./usr/lib64/libGLESv2_nvidia.so.410.79          109K
./usr/lib64/libGLX_nvidia.so.410.79             1.3M
./usr/lib64/libnvidia-cbl.so.410.79             363K
./usr/lib64/libnvidia-cfg.so.410.79             176K
./usr/lib64/libnvidia-eglcore.so.410.79         25M
./usr/lib64/libnvidia-glcore.so.410.79          26M
./usr/lib64/libnvidia-glsi.so.410.79            568K
./usr/lib64/libnvidia-glvkspirv.so.410.79       14M
./usr/lib64/libnvidia-rtcore.so.410.79          26M
./usr/lib64/libnvidia-tls.so.410.79             15K
./usr/lib64/libnvoptix.so.410.79                34M
./usr/lib64/vdpau/libvdpau_nvidia.so.410.79     965K
./usr/share/glvnd/egl_vendor.d/10_nvidia.json
nvidia-driver-NVML-410.79-1.el7.x86_64.rpm      560K  Nvidia Management Library

./usr/lib64/libnvidia-ml.so.410.79              1.5M
nvidia-driver-cuda-libs-410.79-1.el7.x86_64.rpm 33M   Nvidia CUDA API Driver?

./usr/lib64/libcuda.so.410.79                   15M
./usr/lib64/libnvcuvid.so.410.79                2.7M
./usr/lib64/libnvidia-compiler.so.410.79        46M
./usr/lib64/libnvidia-encode.so.410.79          165K
./usr/lib64/libnvidia-fatbinaryloader.so.410.79 286K
./usr/lib64/libnvidia-opencl.so.410.79          28M
./usr/lib64/libnvidia-ptxjitcompiler.so.410.79  12M

可执行:

nvidia-driver-cuda-410.79-1.el7.x86_64.rpm      394K  MPS 和 Nvidia-smi,常用命令

./usr/bin/nvidia-cuda-mps-control
./usr/bin/nvidia-cuda-mps-server
./usr/bin/nvidia-debugdump
./usr/bin/nvidia-smi
nvidia-modprobe-410.79-1.el7.x86_64.rpm          71K  不详

./usr/bin/nvidia-modprobe

不常用:

nvidia-libXNVCtrl-devel-410.79-1.el7.x86_64      62K  不详

./usr/include/NVCtrl
./usr/include/NVCtrl/NVCtrl.h
./usr/include/NVCtrl/NVCtrlLib.h
./usr/include/NVCtrl/nv_control.h
./usr/lib64/libXNVCtrl.so

dkms-nvidia-410.79-1.el7.x86_64.rpm              12M  不详

Registering the NVIDIA Kernel Module with DKMS 不太懂

nvidia-driver-NvFBCOpenGL-410.79-1.el7.x86_64.rpm   135K  不详

./usr/lib64/libnvidia-fbc.so.1
./usr/lib64/libnvidia-fbc.so.410.79
./usr/lib64/libnvidia-ifr.so.1
./usr/lib64/libnvidia-ifr.so.410.79

CUDA Toolkit

安装方法:

wget http://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda_10.1.243_418.87.00_linux.run
sudo sh cuda_10.1.243_418.87.00_linux.run

执行完了之后,应该会在 /usr/local/cuda-9.0/ (版本注意修改)

/usr/local/cuda-9.0/lib64/ 中包含了所有的 CUDA 库文件,从上层到底层分别是:

  • libcublas.so libcufft.so 属于 CUDA library
  • libcudart.so 属于 CUDA runtime
  • libcuda.so 属于 CUDA driver API (nv driver 范畴)
  • nvidia driver (user mode)(nv driver 范畴)
  • nvidia driver (kernel mode)(nv driver 范畴)

注意/usr/local/cuda-9.0/lib64/stubs 文件夹下有很多 libcuda.so 等文件,这个和 Nvidia driver 提供的 libcuda.so 名字一模一样,但是实际上 stubs 下的库是不正确的,目前也不知道他有什么用。

01-18 20:31