问题描述
我能否在具有 MPS 的 NVIDIA Kepler GPU 上同时运行非 MPI CUDA 应用程序?我想这样做是因为我的应用程序无法充分利用 GPU,所以我希望它们共同运行.有没有代码示例可以做到这一点?
Can I run non-MPI CUDA applications concurrently on NVIDIA Kepler GPUs with MPS? I'd like to do this because my applications cannot fully utilize the GPU, so I want them to co-run together. Is there any code example to do this?
推荐答案
必要的说明包含在 MPS 服务的文档.您会注意到,这些指令并不真正依赖或调用 MPI,因此它们确实没有任何 MPI 特定的内容.
The necessary instructions are contained in the documentation for the MPS service. You'll note that those instructions don't really depend on or call out MPI, so there really isn't anything MPI-specific about them.
这是一个演练/示例.
阅读上述链接文档的第 2.3 节,了解各种要求和限制.我建议为此使用 CUDA 7、7.5 或更高版本.与以前版本的 CUDA MPS 存在一些配置差异,我不会在这里介绍.此外,我将演示仅使用单个服务器/单个 GPU.我用于测试的机器是使用 K40c (cc3.5/Kepler) GPU 和 CUDA 7.0 的 CentOS 6.2 节点.节点中还有其他 GPU.在我的例子中,CUDA 枚举顺序将我的 K40c 放置在设备 0 上,但 nvidia-smi 枚举顺序恰好将它作为 id 2 放置在顺序中.在具有多个 GPU 的系统中,所有这些细节都很重要,会影响下面给出的脚本.
Read section 2.3 of the above-linked documentation for various requirements and restrictions. I recommend using CUDA 7, 7.5, or later for this. There were some configuration differences with prior versions of CUDA MPS that I won't cover here. Also, I'll demonstrate just using a single server/single GPU. The machine I am using for test is a CentOS 6.2 node using a K40c (cc3.5/Kepler) GPU, with CUDA 7.0. There are other GPUs in the node. In my case, the CUDA enumeration order places my K40c at device 0, but the nvidia-smi enumeration order happens to place it as id 2 in the order. All of these details matter in a system with multiple GPUs, impacting the scripts given below.
我将创建几个帮助 bash 脚本以及一个测试应用程序.对于测试应用程序,我们想要一些具有明显可以与来自应用程序的其他实例的内核同时运行的内核的东西,并且我们还想要一些能够在这些内核(来自单独的应用程序/进程)时变得明显的东西是否同时运行.为了满足演示目的的这些需求,让我们有一个应用程序,它的内核只在单个 SM 上的单个线程中运行,并且在退出并打印之前简单地等待一段时间(我们将使用约 5 秒)信息.这是一个执行此操作的测试应用程序:
I'll create several helper bash scripts and also a test application. For the test application, we'd like something with kernel(s) that can obviously run concurrently with kernels from other instances of the application, and we'd also like something that makes it obvious when those kernels (from separate apps/processes) are running concurrently or not. To meet these needs for demonstration purposes, let's have an app that has a kernel that just runs in a single thread on a single SM, and simply waits for a period of time (we'll use ~5 seconds) before exiting and printing a message. Here's a test app that does that:
$ cat t1034.cu
#include <stdio.h>
#include <stdlib.h>
#define MAX_DELAY 30
#define cudaCheckErrors(msg)
do {
cudaError_t __err = cudaGetLastError();
if (__err != cudaSuccess) {
fprintf(stderr, "Fatal error: %s (%s at %s:%d)
",
msg, cudaGetErrorString(__err),
__FILE__, __LINE__);
fprintf(stderr, "*** FAILED - ABORTING
");
exit(1);
}
} while (0)
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
unsigned long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
#define APPRX_CLKS_PER_SEC 1000000000ULL
__global__ void delay_kernel(unsigned seconds){
unsigned long long dt = clock64();
while (clock64() < (dt + (seconds*APPRX_CLKS_PER_SEC)));
}
int main(int argc, char *argv[]){
unsigned delay_t = 5; // seconds, approximately
unsigned delay_t_r;
if (argc > 1) delay_t_r = atoi(argv[1]);
if ((delay_t_r > 0) && (delay_t_r < MAX_DELAY)) delay_t = delay_t_r;
unsigned long long difft = dtime_usec(0);
delay_kernel<<<1,1>>>(delay_t);
cudaDeviceSynchronize();
cudaCheckErrors("kernel fail");
difft = dtime_usec(difft);
printf("kernel duration: %fs
", difft/(float)USECPSEC);
return 0;
}
$ nvcc -arch=sm_35 -o t1034 t1034.cu
$ ./t1034
kernel duration: 6.528574s
$
我们将使用 bash 脚本来启动 MPS 服务器:
We'll use a bash script to start the MPS server:
$ cat start_as_root.bash
#!/bin/bash
# the following must be performed with root privilege
export CUDA_VISIBLE_DEVICES="0"
nvidia-smi -i 2 -c EXCLUSIVE_PROCESS
nvidia-cuda-mps-control -d
$
还有一个 bash 脚本同时"启动我们的测试应用的 2 个副本:
And a bash script to launch 2 copies of our test app "simultaneously":
$ cat mps_run
#!/bin/bash
./t1034 &
./t1034
$
我们也可以有一个 bash 脚本来关闭服务器,尽管本演练不需要它:
We could also have a bash script to shut down the server, although it's not needed for this walkthrough:
$ cat stop_as_root.bash
#!/bin/bash
echo quit | nvidia-cuda-mps-control
nvidia-smi -i 2 -c DEFAULT
$
现在,当我们使用上面的 mps_run
脚本启动我们的测试应用程序,但没有实际启用 MPS 服务器时,我们会得到预期的行为,即应用程序的一个实例采用预期的行为~5 秒,而另一个实例大约需要两倍(约 10 秒),因为它不会与来自另一个进程的应用程序同时运行,它会在另一个应用程序/内核运行时等待 5 秒,然后花费 5 秒运行自己的内核,总共大约 10 秒:
Now when we just launch our test app using the mps_run
script above, but without actually enabling the MPS server, we get the expected behavior that one instance of the app takes the expected ~5 seconds, whereas the other instance takes approximately double that (~10 seconds) because, since it does not run concurrently with an app from another process, it waits for 5 seconds while the other app/kernel is running, and then spends 5 seconds running its own kernel, for a total of ~10 seconds:
$ ./mps_run
kernel duration: 6.409399s
kernel duration: 12.078304s
$
另一方面,如果我们先启动 MPS 服务器,然后重复测试:
On the other hand, if we start the MPS server first, and repeat the test:
$ su
Password:
# ./start_as_root.bash
Set compute mode to EXCLUSIVE_PROCESS for GPU 0000:82:00.0.
All done.
# exit
exit
$ ./mps_run
kernel duration: 6.167079s
kernel duration: 6.263062s
$
我们看到两个应用程序需要相同的时间来运行,因为内核是同时运行的,这是由于 MPS.
we see that both apps take the same amount of time to run, because the kernels are running concurrently, due to MPS.
欢迎您进行您认为合适的实验.如果此序列似乎对您正常工作,但运行您自己的应用程序似乎没有给出预期的结果,一个可能的原因可能是您的应用程序/内核无法与应用程序/内核的其他实例同时运行,因为与内核的构造无关,与 MPS 无关.您可能想要验证 并发要求内核,和/或研究 concurrentKernels 示例应用程序.
You're welcome to experiment as you see fit. If this sequence appears to work correctly for you, but running your own application doesn't seem to give the expected results, one possible reason may be that your app/kernels are not able to run concurrently with other instances of the app/kernels due to the construction of your kernels, not anything to do with MPS. You might want to verify the requirements for concurrent kernels, and/or study the concurrentKernels sample app.
这里的大部分信息都是从测试/完成的工作中回收的 这里 尽管这里展示的单独应用程序与那里展示的 MPI 案例不同.
Much of the information here was recycled from the test/work done here albeit the presentation here with separate apps is different than the MPI case presented there.
更新:当从多个进程运行内核时,非 MPS 情况下的调度程序行为似乎随着 Pascal 和更新的 GPU 而改变.上述测试结果对于测试的 GPU(例如 Kepler)仍然是正确的,但是在 Pascal 或更新的 GPU 上运行上述测试用例时,在非 MPS 情况下会观察到不同的结果.最新的 MPS 文档中将调度程序描述为时间片"调度程序似乎正在发生的事情是,调度程序不是等待一个进程的内核完成,而是根据一些未发布的规则,选择抢占正在运行的内核,以便它可以从另一个进程切换到另一个内核.这仍然不意味着来自不同进程的内核在 CUDA 文档中该词的传统用法中同时"运行,但上述代码被时间片调度程序(在 Pascal 和更新版本上)欺骗",因为它取决于关于使用 SM 时钟设置内核持续时间.时间分片调度程序加上 SM 时钟的这种用法的组合使这个测试用例看起来并发"运行.但是,如 MPS 文档中所述,当 A 和 B 在非 MPS 情况下来自不同的进程时,来自内核 A 的代码与来自内核 B 的代码不在相同的时钟周期内执行.
UPDATE: The scheduler behavior in the non-MPS case when running kernels from multiple processes appears to have changed with Pascal and newer GPUs. The above test results still are correct for the GPUs tested on (e.g. Kepler), but when running the above test case on a Pascal or newer GPU, different results will be observed in the non-MPS case. The scheduler is described as a "time-sliced" scheduler in the latest MPS doc and what appears to be happening is that rather than wait for a kernel from one process to complete, the scheduler may, according to some unpublished rules, choose to pre-empt a running kernel so that it can switch to another kernel from another process. This still doesn't mean that kernels from separate processes are running "concurrently" in the traditional usage of that word in CUDA documentation, but the above code is "tricked" by the time-sliced scheduler (on Pascal and newer) because it depends on using the SM clock to set kernel duration. The combination of the time-sliced scheduler plus this usage of the SM clock makes this test case appear to run "concurrently". However, as described in the MPS doc, the code from kernel A is not executing in the same clock cycle(s) as the code from kernel B, when A and B originate from separate processes in the non-MPS case.
使用上述通用方法来演示这一点的另一种方法可能是使用由多个循环设置的内核持续时间,而不是通过读取 SM 时钟设置的内核持续时间,如 .在这种情况下必须小心,以避免编译器优化"循环.
An alternative method to demonstrate this using the above general approach might be to use a kernel duration that is set by a number of loops, rather than a kernel duration that is set by reading the SM clock, as described here. Care must be taken in that case to avoid having the loops "optimized out" by the compiler.
这篇关于如何使用 Nvidia 多进程服务 (MPS) 运行多个非 MPI CUDA 应用程序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!