[CUDA] 00 - GPU Driver Installation & Concurrency Programming

前言

对，这是一个高大上的技术，终于要做老崔当年做过的事情了，生活很传奇。

一、主流 GPU 编程接口

1. CUDA

是英伟达公司推出的，专门针对 N 卡进行 GPU 编程的接口。文档资料很齐全，几乎适用于所有 N 卡。

本专栏讲述的 GPU 编程技术均基于此接口。

2. Open CL

开源的 GPU 编程接口，使用范围最广，几乎适用于所有的显卡。

但相对 CUDA，其掌握较难一些，建议先学 CUDA，在此基础上进行 Open CL 的学习则会非常简单轻松。

3. DirectCompute

微软开发出来的 GPU 编程接口。功能很强大，学习起来也最为简单，但只能用于 Windows 系统，在许多高端服务器都是 UNIX 系统无法使用。

总结，这几种接口各有优劣，需要根据实际情况选用。但它们使用起来方法非常相近，掌握了其中一种再学习其他两种会很容易。

二、并行效率

Ref: https://www.cnblogs.com/muchen/p/6134374.html

[CUDA] 00 - GPU Driver Installation & Concurrency Programming-LMLPHP

三、系统课程

关于GPU并行编程，北美有系统的课程。

CME 213 Introduction to parallel computing using MPI, openMP, and CUDA

Eric Darve, Stanford University

感觉MPI, openMP都被mapReduce干掉了，cuda的部分还有些价值。

CS179:GPU Programming

加州理工 Computing + Mathematical Sciences，2018年课程【推荐】

四、编译环境

安装过程

驱动安装：Installing Ubuntu 16.04 with CUDA 9.0 and cuDNN 7.3 for deep learning【写的比较详细】

自动脚本：rava-dosa/ubuntu 16.04 nvidia 940 mx.sh

参考博文：How to Setup Ubuntu 16.04 with CUDA, GPU, and other requirements for Deep Learning

安装驱动：Deep Learning GPU Installation on Ubuntu 18.4 【已实践验证】

问题解决

X server issue: How to install NVIDIA.run?

出现返回登陆问题的恢复办法：https://www.jianshu.com/p/34236a9c4a2f

sudo apt-get remove --purge nvidia-*

sudo apt-get install ubuntu-desktop

sudo rm /etc/X11/xorg.conf

echo 'nouveau' | sudo tee -a /etc/modules

#重启系统

sudo reboot

驱动切换

[CUDA] 00 - GPU Driver Installation & Concurrency Programming-LMLPHP

sudo apt-get install nvidia-cuda-toolkit

cpu one thread   : Time cost: 30.723241 sec, data[100] is -0.207107 
gpu full threads : Time cost:  0.107630 sec, data[100] is -0.207107

这是在笔记本的测试，即使cpu四线程全开，也是70倍的加速。

五、机器学习和GPU

当前貌似还没有一套完整的方案，但是不同的算法貌似有不同的库支持相应的GPU加速版本，也就是说，这一领域暂时处于“三国分立”的阶段。

Is there a machine learning library that implements classical algorithms (non-deeplearning) using GPUs for acceleration?

编程套路

没错，一切皆套路

一、GPU Computing: Step by Step

• Setup inputs on the host (CPU-accessible memory)
• Allocate memory for outputs on the host
• Allocate memory for inputs on the GPU
• Allocate memory for outputs on the GPU
• Copy inputs from host to GPU
• Start GPU kernel (function that executed on gpu)
• Copy output from GPU to host

硬件知识

一、N卡发展小故事

Link: jcjohnson/cnn-benchmarks

Ref: Build a super fast deep learning machine for under $1,000

Graphics card/GPU

Perhaps the most important attribute to look at for deep learning is the available RAM on the card. If TensorFlow can’t fit the model and the current batch of training data into the GPU’s RAM it will fail over to the CPU—making the GPU pointless.

至少CPU先稳定，再谈GPU的事儿。

Another key consideration is the architecture of the graphics card. The last few architectures NVIDIA has put out have been called “Kepler,” “Maxwell,” and “Pascal”—in that order. The difference between the architectures really matters for speed; for example, the Pascal Titan X is twice the speed of a Maxwell Titan X according to this benchmark.

[CUDA] 00 - GPU Driver Installation & Concurrency Programming-LMLPHP

GPUs are critical: The Pascal Titan X with cuDNN is 49x to 74x faster than dual Xeon E5-2630 v3 CPUs.

Most of the papers on machine learning use the TITAN X card, which is fantastic but costs at least $1,000, even for an older version. Most people doing machine learning without infinite budget use the NVIDIA GTX 900 series (Maxwell) or the NVIDIA GTX 1000 series (Pascal).

To figure out the architecture of a card, you can look at the spectacularly confusing naming conventions of NVIDIA: the 9XX cards use the Maxwell architecture while the 10XX cards use the Pascal architecture.

But a 980 card is still probably significantly faster than a 1060 due to higher clock speed and more RAM.

You will have to set different flags for NVIDIA cards based on the architecture of the GPU you get. But the most important thing is any 9XX or 10XX card will be an order of magnitude faster than your laptop.

Don’t be paralyzed by the options; if you haven’t worked with a GPU, they will all be much better than what you have now.

I went with the GeForce GTX 1060 3GB for $195, and it runs models about 20 times faster than my MacBook, but it occasionally runs out of memory for some applications, so I probably should have gotten the GeForce GTX 1060 6GB for an additional $60.

2017-04-09

General GPU Recommendations

Generally, I would recommend the GTX 1080 Ti, GTX 1080 or GTX 1070.

They are all excellent cards and if you have the money for a GTX 1080 Ti you should go ahead with that.

The GTX 1070 is a bit cheaper and still faster than a regular GTX Titan X (Maxwell).

The GTX 1080 was bit less cost efficient than the GTX 1070 but since the GTX 1080 Ti was introduced the price fell significantly and now the GTX 1080 is able to compete with the GTX 1070.

All these three cards should be preferred over the GTX 980 Ti due to their increased memory of 11GB and 8GB (instead of 6GB).

【GPU显存最好大于6GB】

I personally would go with multiple GTX 1070 or GTX 1080 for research. I rather run a few more experiments which are a bit slower than running just one experiment which is faster.

In NLP the memory constraints are not as tight as in computer vision and so a GTX 1070/GTX 1080 is just fine for me. The tasks I work on and how I run my experiments determines the best choice for me, which is either a GTX 1070 or GTX 1080.

NVIDIA GeForce GTX 1070

[CUDA] 00 - GPU Driver Installation & Concurrency Programming-LMLPHP

电源要好，易于显卡扩展。

内存建议16GB。

CPU Intel i5

三、GPU模式切换

From: Tensorflow中使用指定的GPU及GPU显存

(1) 终端执行程序时设置使用的GPU

如果电脑有多个GPU，tensorflow默认全部使用。

如果想只使用部分GPU，可以设置CUDA_VISIBLE_DEVICES。在调用python程序时，可以使用（Link 中 Franck Dernoncourt的回复）：

CUDA_VISIBLE_DEVICES=1 python my_script.py

[CUDA] 00 - GPU Driver Installation & Concurrency Programming-LMLPHP

Environment Variable Syntax      Results

CUDA_VISIBLE_DEVICES=1           Only device 1 will be seen

CUDA_VISIBLE_DEVICES=0,1         Devices 0 and 1 will be visible

CUDA_VISIBLE_DEVICES="0,1"       Same as above, quotation marks are optional

CUDA_VISIBLE_DEVICES=0,2,3       Devices 0, 2, 3 will be visible; device 1 is masked

CUDA_VISIBLE_DEVICES=""          No GPU will be visible

(2) python代码中设置使用的GPU

如果要在python代码中设置使用的GPU（如使用pycharm进行调试时），可以使用下面的代码（Link 中 Yaroslav Bulatov的回复）：

import os

os.environ["CUDA_VISIBLE_DEVICES"] = "2"

(3) 设置tensorflow使用的显存大小

<3.1> 定量设置显存

默认tensorflow是使用GPU尽可能多的显存。可以通过下面的方式，来设置使用的GPU显存：

gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.7)　　// 分配：GPU实际显存*0.7

sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))

<3.2> 按需设置显存

上面的只能设置固定的大小。如果想按需分配，可以使用allow_growth参数（参考网址：http://blog.csdn.net/cq361106306/article/details/52950081）：

gpu_options = tf.GPUOptions(allow_growth=True)

sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))

四、Phone GPU

Android GPU: https://blog.csdn.net/u011723240/article/details/30109763

培训大纲：https://blog.csdn.net/PCb4jR/article/details/78890915

Sobel算子对比，表中是一些测量得到的结果：

[CUDA] 00 - GPU Driver Installation & Concurrency Programming-LMLPHP

从上述结果可以看出，在上述实验平台上，随着图片大小的增大（数据处理更加复杂），并行化的加速比会更加明显。

End.

GTX