如何使用Nvidia多进程服务（MPS）运行多个非MPI CUDA应用程序？

本文介绍了如何使用Nvidia多进程服务（MPS）运行多个非MPI CUDA应用程序？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我可以在搭载MPS的NVIDIA Kepler GPU上同时运行非MPI CUDA应用程序吗？我想这样做，因为我的应用程序不能完全利用GPU，所以我希望他们一起共同运行。是否有任何代码示例来执行此操作？

Can I run non-MPI CUDA applications concurrently on NVIDIA Kepler GPUs with MPS? I'd like to do this because my applications cannot fully utilize the GPU, so I want them to co-run together. Is there any code example to do this?

推荐答案

必要的说明包含在。你会注意到，这些说明并不真正依赖或调用MPI，所以实际上没有什么MPI特有的。

The necessary instructions are contained in the documentation for the MPS service. You'll note that those instructions don't really depend on or call out MPI, so there really isn't anything MPI-specific about them.

这里是一个演练/示例。

Here's a walkthrough/example.

阅读上述链接文档的第2.3节，了解各种要求和限制。我建议使用CUDA 7,7.5或更高版本。有一些配置与之前的CUDA MPS版本的差异，我不会在这里介绍。此外，我将演示只使用单个服务器/单GPU。我用于测试的机器是一个CentOS 6.2节点使用K40c（cc3.5 /开普勒）GPU，与CUDA 7.0。节点中还有其他GPU。在我的情况下，CUDA枚举顺序将我的K40c放在设备0，但是nvidia-smi枚举顺序发生在它的顺序为id 2。

Read section 2.3 of the above-linked documentation for various requirements and restrictions. I recommend using CUDA 7, 7.5, or later for this. There were some configuration differences with prior versions of CUDA MPS that I won't cover here. Also, I'll demonstrate just using a single server/single GPU. The machine I am using for test is a CentOS 6.2 node using a K40c (cc3.5/Kepler) GPU, with CUDA 7.0. There are other GPUs in the node. In my case, the CUDA enumeration order places my K40c at device 0, but the nvidia-smi enumeration order happens to place it as id 2 in the order. All of these details matter in a system with multiple GPUs, impacting the scripts given below.

我将创建多个帮助程序bash脚本和测试应用程序。对于测试应用程序，我们需要一些内核，这些内核显然可以与应用程序的其他实例同时运行，并且我们也希望这些内核（从单独的应用程序/进程）是否同时运行。为了满足演示的需要，让我们有一个应用程序，它有一个内核只在一个单独的线程上运行，只需等待一段时间（我们将使用〜5秒），然后退出和打印信息。这里有一个测试应用程序：

I'll create several helper bash scripts and also a test application. For the test application, we'd like something with kernel(s) that can obviously run concurrently with kernels from other instances of the application, and we'd also like something that makes it obvious when those kernels (from separate apps/processes) are running concurrently or not. To meet these needs for demonstration purposes, let's have an app that has a kernel that just runs in a single thread on a single SM, and simply waits for a period of time (we'll use ~5 seconds) before exiting and printing a message. Here's a test app that does that:

$ cat t1034.cu
#include <stdio.h>
#include <stdlib.h>

#define MAX_DELAY 30

#define cudaCheckErrors(msg) \
  do { \
    cudaError_t __err = cudaGetLastError(); \
    if (__err != cudaSuccess) { \
        fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
            msg, cudaGetErrorString(__err), \
            __FILE__, __LINE__); \
        fprintf(stderr, "*** FAILED - ABORTING\n"); \
        exit(1); \
    } \
  } while (0)


#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

#define APPRX_CLKS_PER_SEC 1000000000ULL
__global__ void delay_kernel(unsigned seconds){

  unsigned long long dt = clock64();
  while (clock64() < (dt + (seconds*APPRX_CLKS_PER_SEC)));
}

int main(int argc, char *argv[]){

  unsigned delay_t = 5; // seconds, approximately
  unsigned delay_t_r;
  if (argc > 1) delay_t_r = atoi(argv[1]);
  if ((delay_t_r > 0) && (delay_t_r < MAX_DELAY)) delay_t = delay_t_r;
  unsigned long long difft = dtime_usec(0);
  delay_kernel<<<1,1>>>(delay_t);
  cudaDeviceSynchronize();
  cudaCheckErrors("kernel fail");
  difft = dtime_usec(difft);
  printf("kernel duration: %fs\n", difft/(float)USECPSEC);
  return 0;
}


$ nvcc -arch=sm_35 -o t1034 t1034.cu
$ ./t1034
kernel duration: 6.528574s
$

我们将使用bash脚本启动MPS服务器：

We'll use a bash script to start the MPS server:

$ cat start_as_root.bash
#!/bin/bash
# the following must be performed with root privilege
export CUDA_VISIBLE_DEVICES="0"
nvidia-smi -i 2 -c EXCLUSIVE_PROCESS
nvidia-cuda-mps-control -d
$

和bash脚本同时启动我们的测试应用程序的2个副本：

And a bash script to launch 2 copies of our test app "simultaneously":

$ cat mps_run
#!/bin/bash
./t1034 &
./t1034
$

脚本关闭服务器，虽然它不需要这个演练：

We could also have a bash script to shut down the server, although it's not needed for this walkthrough:

$ cat stop_as_root.bash
#!/bin/bash
echo quit | nvidia-cuda-mps-control
nvidia-smi -i 2 -c DEFAULT
$

现在，当我们使用上面的 mps_run 脚本启动我们的测试应用程序，但实际上没有启用MPS服务器时，的应用程序需要约5秒，而另一个实例需要大约两倍（〜10秒），因为它不与另一个进程的应用程序并发运行，它等待5秒，而其他应用程序/内核运行，然后花5秒钟运行自己的内核，总共约10秒：

Now when we just launch our test app using the mps_run script above, but without actually enabling the MPS server, we get the expected behavior that one instance of the app takes the expected ~5 seconds, whereas the other instance takes approximately double that (~10 seconds) because, since it does not run concurrently with an app from another process, it waits for 5 seconds while the other app/kernel is running, and then spends 5 seconds running its own kernel, for a total of ~10 seconds:

$ ./mps_run
kernel duration: 6.409399s
kernel duration: 12.078304s
$

另一方面，如果我们先启动MPS服务器，然后重复测试：

On the other hand, if we start the MPS server first, and repeat the test:

$ su
Password:
# ./start_as_root.bash
Set compute mode to EXCLUSIVE_PROCESS for GPU 0000:82:00.0.
All done.
# exit
exit
$ ./mps_run
kernel duration: 6.167079s
kernel duration: 6.263062s
$

我们看到两个应用程序运行的时间都相同，因为内核由于MPS而并发运行。

we see that both apps take the same amount of time to run, because the kernels are running concurrently, due to MPS.

欢迎您根据需要进行实验。如果这个序列似乎为你正常工作，但运行你自己的应用程序似乎没有给出预期的结果，一个可能的原因可能是你的应用程序/内核不能与应用程序/内核的其他实例同时运行到你的内核的构造，而不是与MPS有关。您可能需要验证，和/或研究。

You're welcome to experiment as you see fit. If this sequence appears to work correctly for you, but running your own application doesn't seem to give the expected results, one possible reason may be that your app/kernels are not able to run concurrently with other instances of the app/kernels due to the construction of your kernels, not anything to do with MPS. You might want to verify the requirements for concurrent kernels, and/or study the concurrentKernels sample app.

这里的大部分信息都是从测试/工作中完成的，尽管此处使用单独应用程序的演示与此处提供的MPI案例不同。

Much of the information here was recycled from the test/work done here albeit the presentation here with separate apps is different than the MPI case presented there.

这篇关于如何使用Nvidia多进程服务（MPS）运行多个非MPI CUDA应用程序？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！