cudaEventRecord的位置和来自不同流的重叠ops | cudaEventRecord

cudaEventRecord

关注(28)粉丝(399)

cudaEventRecord的位置和来自不同流的重叠ops

本文介绍了cudaEventRecord的位置和来自不同流的重叠ops的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述我有两个任务。它们中的每一个都执行复制到设备（D），运行内核（R）和复制到主机（H）操作。我重叠复制到task2（D2）的设备与运行内核task1（R1）。此外，我重复运行任务2（R2）的内核与副本到任务1（H1）的主机。我还记录D，R， H ops每个任务使用cudaEventRecord。我有GeForce GT 555M，CUDA 4.1和Fedora 16。我有三种情况： p> 场景1：每个任务使用一个流。我在开始之前/之后放置开始/停止事件。场景2：每个任务使用一个流。我将第二个重叠ops的开始事件放在第一个开始之前（即在开始D2 之前放置开始R1 ，并将开始H1 场景3：每个任务使用两个流。我使用cudaStreamWaitEvents在这两个流之间同步。一个流用于D和H（复制）操作，另一个用于R操作。我在开始之前/之后放置开始/停止事件。 Scenario1 无法重叠ops（D2-R1和R2-H1重叠），而 Scenario2 和 Scenario3 成功。我的问题是：为什么Scenerio1失败，而其他的成功？执行Task1和Task2的时间。运行R1和R2每个花费5 ms。由于 Scenario1 无法重叠操作，因此总体时间比情况2和3 多10ms。以下是场景的伪代码：场景1（FAILS）：对task1使用stream1，对task2使用stream2 开始整体在stream1上启动D1 D1 on stream1 在stream1上停止D1 在stream2上启动D2 在stream2上启用D2 在stream2上停止D2 启动R1 on stream1 R1 on stream1 stop R1 on stream1 在流2上启动R2 在stream2上启用R2 在stream2上停止R2 start H1 on stream1 H1 on stream1 stop H1 on stream1 开始H2 on stream2 H2 on stream2 停止H2 on stream2 stop overall Scenario2（SUCCEEDS）： stream1用于task1，使用stream2用于task2，上移第二个重叠操作的开始事件。开始整体开始D1 on stream1 D1 on stream1 stop D1 on stream1 在流1上启动R1 //上移在流2上启动D2 在流2上启用D2 停止流2上的D2 R1 on stream1 stop R1 on stream1 start H1 on stream1 //上移在流2上启动R2 R2 on stream2 停止R2 on stream2 H1 on stream1 停止H1在stream1上在流2上启动H2 在stream2上启动H2 在stream2上停止H2 停止整体场景3 SUCCEEDS）：对task1使用stream1和3，对task2使用stream2和4 在stream1上启动D1 在stream1上的D1 在stream1上停止D1 在stream2上启动D2 在stream2上启用D2 在stream2上停止D2 在stream3上启动R1 在stream3上的R1 在stream3上停止R1 在流4上启动R2 在stream4上启动R2 停止R2 on stream4 开始H1在stream1上 H1在stream1上停止H1在stream1上开始H2在流2 H2在流2 在stream2上停止H2 停止整体所有情形的总体时间信息： Scenario1 = 39.390240 Scenario2 = 29.190241 Scenario3 = 29.298208 ： #include< stdio.h> #include< cuda_runtime.h> #include< sys / time.h> __global__ void VecAdd（const float * A，const float * B，float * C，int N） { int i = blockDim.x * blockIdx.x + threadIdx 。X; if（i { C [i] = A [i] + B [N-i] C [i] = A [i] + B [i] * 2; C [i] = A [i] + B [i] * 3; C [i] = A [i] + B [i] * 4; C [i] = A [i] + B [i]; } } void overlap（） { float * h_A; float * d_A，* d_C; float * h_A2; float * d_A2，* d_C2; int N = 10000000; size_t size = N * sizeof（float）; cudaMallocHost（（void **）& h_A，size）; cudaMallocHost（（void **）& h_A2，size）; //在设备内存中分配向量 cudaMalloc（（void **）& d_A，size）; cudaMalloc（（void **）& d_C，size）; cudaMalloc（（void **）& d_A2，size）; cudaMalloc（（void **）& d_C2，size）; float fTimCpyDev1，fTimKer1，fTimCpyHst1，fTimCpyDev2，fTimKer2，fTimCpyHst2; float fTimOverall3，fTimOverall1，fTimOverall2; for（int i = 0; i { h_A [i] = 1; h_A2 [i] = 5; } int threadsPerBlock = 256; int blocksPerGrid =（N + threadsPerBlock - 1）/ threadsPerBlock; cudaStream_t csStream1，csStream2，csStream3，csStream4; cudaStreamCreate（& csStream1）; cudaStreamCreate（& csStream2）; cudaStreamCreate（& csStream3）; cudaStreamCreate（& csStream4）; cudaEvent_t ceEvStart，ceEvStop; cudaEventCreate（& ceEvStart）; cudaEventCreate（& ceEvStop）; cudaEvent_t ceEvStartCpyDev1，ceEvStopCpyDev1，ceEvStartKer1，ceEvStopKer1，ceEvStartCpyHst1，ceEvStopCpyHst1; cudaEventCreate（& ceEvStartCpyDev1）; cudaEventCreate（& ceEvStopCpyDev1）; cudaEventCreate（& ceEvStartKer1）; cudaEventCreate（& ceEvStopKer1）; cudaEventCreate（& ceEvStartCpyHst1）; cudaEventCreate（& ceEvStopCpyHst1）; cudaEvent_t ceEvStartCpyDev2，ceEvStopCpyDev2，ceEvStartKer2，ceEvStopKer2，ceEvStartCpyHst2，ceEvStopCpyHst2; cudaEventCreate（& ceEvStartCpyDev2）; cudaEventCreate（& ceEvStopCpyDev2）; cudaEventCreate（& ceEvStartKer2）; cudaEventCreate（& ceEvStopKer2）; cudaEventCreate（& ceEvStartCpyHst2）; cudaEventCreate（& ceEvStopCpyHst2）; // Scenario1 cudaDeviceSynchronize（）; cudaEventRecord（ceEvStart，0）; cudaEventRecord（ceEvStartCpyDev1，csStream1）; cudaMemcpyAsync（d_A，h_A，size，cudaMemcpyHostToDevice，csStream1）; cudaEventRecord（ceEvStopCpyDev1，csStream1）; cudaEventRecord（ceEvStartCpyDev2，csStream2）; cudaMemcpyAsync（d_A2，h_A2，size，cudaMemcpyHostToDevice，csStream2）; cudaEventRecord（ceEvStopCpyDev2，csStream2）; cudaEventRecord（ceEvStartKer1，csStream1）; VecAdd<<<块blockPerGrid，threadsPerBlock，0，csStream1>>>（d_A，d_A，d_C，N） cudaEventRecord（ceEvStopKer1，csStream1）; cudaEventRecord（ceEvStartKer2，csStream2）; VecAdd<<<块blockPerGrid，threadsPerBlock，0，csStream2>>>（d_A2，d_A2，d_C2，N） cudaEventRecord（ceEvStopKer2，csStream2）; cudaEventRecord（ceEvStartCpyHst1，csStream1）; cudaMemcpyAsync（h_A，d_C，size，cudaMemcpyDeviceToHost，csStream1）; cudaEventRecord（ceEvStopCpyHst1，csStream1）; cudaEventRecord（ceEvStartCpyHst2，csStream2）; cudaMemcpyAsync（h_A2，d_C2，size，cudaMemcpyDeviceToHost，csStream2）; cudaEventRecord（ceEvStopCpyHst2，csStream2）; cudaEventRecord（ceEvStop，0）; cudaDeviceSynchronize（）; cudaEventElapsedTime（& fTimOverall1，ceEvStart，ceEvStop）; printf（Scenario1 overall time =％10f \\\，fTimOverall1）; // Scenario2 cudaDeviceSynchronize（）; cudaEventRecord（ceEvStart，0）; cudaEventRecord（ceEvStartCpyDev1，csStream1）; cudaMemcpyAsync（d_A，h_A，size，cudaMemcpyHostToDevice，csStream1）; cudaEventRecord（ceEvStopCpyDev1，csStream1）; cudaEventRecord（ceEvStartKer1，csStream1）; // moved up cudaEventRecord（ceEvStartCpyDev2，csStream2）; cudaMemcpyAsync（d_A2，h_A2，size，cudaMemcpyHostToDevice，csStream2）; cudaEventRecord（ceEvStopCpyDev2，csStream2）; VecAdd<<<块blockPerGrid，threadsPerBlock，0，csStream1>>>（d_A，d_A，d_C，N） cudaEventRecord（ceEvStopKer1，csStream1）; cudaEventRecord（ceEvStartCpyHst1，csStream1）; //上升 cudaEventRecord（ceEvStartKer2，csStream2）; VecAdd<<<块blockPerGrid，threadsPerBlock，0，csStream2>>>（d_A2，d_A2，d_C2，N） cudaEventRecord（ceEvStopKer2，csStream2）; cudaMemcpyAsync（h_A，d_C，size，cudaMemcpyDeviceToHost，csStream1）; cudaEventRecord（ceEvStopCpyHst1，csStream1）; cudaEventRecord（ceEvStartCpyHst2，csStream2）; cudaMemcpyAsync（h_A2，d_C2，size，cudaMemcpyDeviceToHost，csStream2）; cudaEventRecord（ceEvStopCpyHst2，csStream2）; cudaEventRecord（ceEvStop，0）; cudaDeviceSynchronize（）; cudaEventElapsedTime（& fTimOverall2，ceEvStart，ceEvStop）; printf（Scenario2 overall time =％10f \\\，fTimOverall2）; // Scenario3 cudaDeviceSynchronize（）; cudaEventRecord（ceEvStart，0）; cudaEventRecord（ceEvStartCpyDev1，csStream1）; cudaMemcpyAsync（d_A，h_A，size，cudaMemcpyHostToDevice，csStream1）; cudaEventRecord（ceEvStopCpyDev1，csStream1）; cudaEventRecord（ceEvStartCpyDev2，csStream2）; cudaMemcpyAsync（d_A2，h_A2，size，cudaMemcpyHostToDevice，csStream2）; cudaEventRecord（ceEvStopCpyDev2，csStream2）; cudaStreamWaitEvent（csStream3，ceEvStopCpyDev1，0）; cudaEventRecord（ceEvStartKer1，csStream3）; VecAdd<<<块blockPerGrid，threadsPerBlock，0，csStream3>>（d_A，d_A，d_C，N） cudaEventRecord（ceEvStopKer1，csStream3）; cudaStreamWaitEvent（csStream4，ceEvStopCpyDev2，0）; cudaEventRecord（ceEvStartKer2，csStream4）; VecAdd<<<块blockPerGrid，threadsPerBlock，0，csStream4>>>（d_A2，d_A2，d_C2，N） cudaEventRecord（ceEvStopKer2，csStream4）; cudaStreamWaitEvent（csStream1，ceEvStopKer1，0）; cudaEventRecord（ceEvStartCpyHst1，csStream1）; cudaMemcpyAsync（h_A，d_C，size，cudaMemcpyDeviceToHost，csStream1）; cudaEventRecord（ceEvStopCpyHst1，csStream1）; cudaStreamWaitEvent（csStream2，ceEvStopKer2，0）; cudaEventRecord（ceEvStartCpyHst2，csStream2）; cudaMemcpyAsync（h_A2，d_C2，size，cudaMemcpyDeviceToHost，csStream2）; cudaEventRecord（ceEvStopCpyHst2，csStream2）; cudaEventRecord（ceEvStop，0）; cudaDeviceSynchronize（）; cudaEventElapsedTime（& fTimOverall3，ceEvStart，ceEvStop）; printf（Scenario3 overall time =％10f \\\，fTimOverall3）; cudaStreamDestroy（csStream1）; cudaStreamDestroy（csStream2）; cudaStreamDestroy（csStream3）; cudaStreamDestroy（csStream4）; cudaFree（d_A）; cudaFree（d_C）; cudaFreeHost（h_A）; cudaFree（d_A2）; cudaFree（d_C2）; cudaFreeHost（h_A2）; } int main（） { overlap（）; } 非常感谢您提前准备的时间！解决方案（注意，我更熟悉特斯拉系列设备，实际上没有GT 555M来试验，所以我的结果具体到C2070，我不知道555m有多少个复制引擎，但我预计下面描述的问题是导致你看到的行为。）问题是一个鲜为人知的事实，cudaEventRecords也是CUDA操作，他们也必须在启动/执行之前放置在硬件队列之一。（一个复杂的因素是，因为cudaEventRecord既不是复制操作也不是计算内核，它实际上可以进入任何硬件队列。我的理解是，它们通常进入与同一流的前面CUDA操作相同的硬件队列，但由于这不是在文档中指定的实际操作可能是设备/驱动程序依赖。）如果我可以扩展你的符号使用'E'记录，并详细说明硬件队列是如何填充的（类似于 CUDA C / C ++ Streams and Concurrency webinar）然后，在您的方案1示例中，您有： CUDA操作的订单： ED1 D1 ED1 ED2 D2 ED2 ER1 R1 ER1 ... 这些填充队列：硬件队列：copyH2D内核 ------- ------ ED1 * R1 D1 / ER1 ED1 / ... ED2 / D2 / ED2 / ER1 * ，你可以看到R1，由于在流1，将不会执行，直到ER1完成，这将不会发生，直到D1和D2都完成因为它们都在H2D复制队列中序列化。通过在方案2中移动cudaEventRecord，ER1，可以避免这种情况，因为流1中的所有CUDA操作到R1，在D2之前完成。这允许R1同时启动D2。硬件队列：copyH2D内核 ------- - ----- ED1 * R1 D1 / ER1 ED1 / ... ER1 * ED2 D2 ED2 在您的情景3中，ER1替换为ER3。由于这是流3中的第一个操作，它可以在任何地方，（猜测）进入内核或复制D2H队列，它可以立即启动，（如果你没有 cudaStreamWaitEvent（csStream3，ceEvStopCpyDev1，0）; b $ b 用于与流1同步），因此不会导致与D2的错误序列化。硬件队列：copyH2D内核 ------- ------ ED1 * ER3 D1 / R3 ED1 * ER3 ED2 ... D2 ED2 我的意见是考虑并发性时，CUDA操作的发出订单非常重要 cudaEventRecord和类似操作，放置在硬件队列上，导致假序列化。它们如何放置在硬件队列中的确切方式没有很好地描述，并且可以是设备/驱动器依赖。因此，为了实现最佳并发性，应将cudaEventRecord和类似操作的使用减少到必要的最低限度。如果内核需要定时进行性能研究，那么可以使用事件，打破并发。但是你应该注意到即将到来的Kepler GK110（Tesla K20）设备通过使用32个硬件队列来显着改进减少错误序列化。有关详情，请参阅 GK110白皮书第17页）。希望这有助。 I have two tasks. Each of them perform copy to device (D), run kernel (R), and copy to host (H) operations. I am overlapping copy to device of task2 (D2) with run kernel of task1 (R1). In addition, I am overlapping run kernel of task2 (R2) with copy to host of task1 (H1).I also record start and stop time of D, R, H ops of each task using cudaEventRecord. I have GeForce GT 555M, CUDA 4.1, and Fedora 16.I have three scenarios:Scenario1: I use one stream for each task. I place start/stop events right before/after the ops.Scenario2: I use one stream for each task. I place the start event of the second of the overlapping ops before the start of first one (i.e. place start R1 before start D2, and place start H1 before start R2).Scenario3: I use two streams for each task. I use cudaStreamWaitEvents to synchronize between these two streams. One stream is used for D and H (copy) ops, the other one is used for R op. I place start/stop events right before/after the ops.Scenario1 fails to overlap ops (neither D2-R1 nor R2-H1 can be overlapped), whereas Scenario2 and Scenario3 succeed. And my question is: Why Scenerio1 fails while the other ones succeed?For each scenario I measure the overall time for performing Task1 and Task2. Running both R1 and R2 takes 5 ms each. Since Scenario1 fails to overlap ops, the overall time is 10ms more than Scenario 2 and 3. Here are the pseudo-code for scenarios:Scenario1 (FAILS): use stream1 for task1, use stream2 for task2start overall start D1 on stream1 D1 on stream1stop D1 on stream1 start D2 on stream2D2 on stream2stop D2 on stream2start R1 on stream1R1 on stream1stop R1 on stream1start R2 on stream2R2 on stream2stop R2 on stream2start H1 on stream1H1 on stream1stop H1 on stream1start H2 on stream2H2 on stream2stop H2 on stream2stop overall Scenario2 (SUCCEEDS): use stream1 for task1, use stream2 for task2, move-up the start event of the second of the overlaping ops. start overallstart D1 on stream1D1 on stream1stop D1 on stream1 start R1 on stream1 //moved-upstart D2 on stream2D2 on stream2stop D2 on stream2R1 on stream1stop R1 on stream1start H1 on stream1 //moved-upstart R2 on stream2R2 on stream2stop R2 on stream2H1 on stream1stop H1 on stream1start H2 on stream2H2 on stream2stop H2 on stream2stop overall Scenario3 (SUCCEEDS): use stream1 and 3 for task1, use stream2 and 4 for task2start overallstart D1 on stream1D1 on stream1stop D1 on stream1 start D2 on stream2D2 on stream2stop D2 on stream2start R1 on stream3R1 on stream3stop R1 on stream3start R2 on stream4R2 on stream4stop R2 on stream4start H1 on stream1H1 on stream1stop H1 on stream1start H2 on stream2H2 on stream2stop H2 on stream2stop overallHere are the overall timing info for all Scenarios: Scenario1 = 39.390240 Scenario2 = 29.190241 Scenario3 = 29.298208I also attach the CUDA code below:#include <stdio.h>#include <cuda_runtime.h>#include <sys/time.h>__global__ void VecAdd(const float* A, const float* B, float* C, int N){ int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < N) { C[i] = A[i] + B[N-i]; C[i] = A[i] + B[i] * 2; C[i] = A[i] + B[i] * 3; C[i] = A[i] + B[i] * 4; C[i] = A[i] + B[i]; }}void overlap(){float* h_A;float *d_A, *d_C;float* h_A2;float *d_A2, *d_C2;int N = 10000000;size_t size = N * sizeof(float); cudaMallocHost((void**) &h_A, size);cudaMallocHost((void**) &h_A2, size);// Allocate vector in device memorycudaMalloc((void**)&d_A, size);cudaMalloc((void**)&d_C, size);cudaMalloc((void**)&d_A2, size);cudaMalloc((void**)&d_C2, size);float fTimCpyDev1, fTimKer1, fTimCpyHst1, fTimCpyDev2, fTimKer2, fTimCpyHst2;float fTimOverall3, fTimOverall1, fTimOverall2;for (int i = 0; i<N; ++i) { h_A[i] = 1; h_A2[i] = 5; }int threadsPerBlock = 256;int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;cudaStream_t csStream1, csStream2, csStream3, csStream4;cudaStreamCreate(&csStream1);cudaStreamCreate(&csStream2);cudaStreamCreate(&csStream3);cudaStreamCreate(&csStream4);cudaEvent_t ceEvStart, ceEvStop; cudaEventCreate( &ceEvStart );cudaEventCreate( &ceEvStop );cudaEvent_t ceEvStartCpyDev1, ceEvStopCpyDev1, ceEvStartKer1, ceEvStopKer1, ceEvStartCpyHst1, ceEvStopCpyHst1;cudaEventCreate( &ceEvStartCpyDev1 );cudaEventCreate( &ceEvStopCpyDev1 );cudaEventCreate( &ceEvStartKer1 );cudaEventCreate( &ceEvStopKer1 );cudaEventCreate( &ceEvStartCpyHst1 );cudaEventCreate( &ceEvStopCpyHst1 );cudaEvent_t ceEvStartCpyDev2, ceEvStopCpyDev2, ceEvStartKer2, ceEvStopKer2, ceEvStartCpyHst2, ceEvStopCpyHst2; cudaEventCreate( &ceEvStartCpyDev2 );cudaEventCreate( &ceEvStopCpyDev2 );cudaEventCreate( &ceEvStartKer2 );cudaEventCreate( &ceEvStopKer2 );cudaEventCreate( &ceEvStartCpyHst2 );cudaEventCreate( &ceEvStopCpyHst2 );//Scenario1cudaDeviceSynchronize();cudaEventRecord(ceEvStart, 0);cudaEventRecord(ceEvStartCpyDev1, csStream1);cudaMemcpyAsync(d_A, h_A, size, cudaMemcpyHostToDevice, csStream1);cudaEventRecord(ceEvStopCpyDev1, csStream1);cudaEventRecord(ceEvStartCpyDev2, csStream2);cudaMemcpyAsync(d_A2, h_A2, size, cudaMemcpyHostToDevice, csStream2);cudaEventRecord(ceEvStopCpyDev2, csStream2);cudaEventRecord(ceEvStartKer1, csStream1); VecAdd<<<blocksPerGrid, threadsPerBlock, 0, csStream1>>>(d_A, d_A, d_C, N);cudaEventRecord(ceEvStopKer1, csStream1); cudaEventRecord(ceEvStartKer2, csStream2); VecAdd<<<blocksPerGrid, threadsPerBlock, 0, csStream2>>>(d_A2, d_A2, d_C2, N);cudaEventRecord(ceEvStopKer2, csStream2);cudaEventRecord(ceEvStartCpyHst1, csStream1);cudaMemcpyAsync(h_A, d_C, size, cudaMemcpyDeviceToHost, csStream1);cudaEventRecord(ceEvStopCpyHst1, csStream1);cudaEventRecord(ceEvStartCpyHst2, csStream2);cudaMemcpyAsync(h_A2, d_C2, size, cudaMemcpyDeviceToHost, csStream2);cudaEventRecord(ceEvStopCpyHst2, csStream2);cudaEventRecord(ceEvStop, 0);cudaDeviceSynchronize();cudaEventElapsedTime( &fTimOverall1, ceEvStart, ceEvStop);printf("Scenario1 overall time= %10f\n", fTimOverall1);//Scenario2 cudaDeviceSynchronize();cudaEventRecord(ceEvStart, 0);cudaEventRecord(ceEvStartCpyDev1, csStream1);cudaMemcpyAsync(d_A, h_A, size, cudaMemcpyHostToDevice, csStream1);cudaEventRecord(ceEvStopCpyDev1, csStream1);cudaEventRecord(ceEvStartKer1, csStream1); //moved up cudaEventRecord(ceEvStartCpyDev2, csStream2);cudaMemcpyAsync(d_A2, h_A2, size, cudaMemcpyHostToDevice, csStream2);cudaEventRecord(ceEvStopCpyDev2, csStream2);VecAdd<<<blocksPerGrid, threadsPerBlock, 0, csStream1>>>(d_A, d_A, d_C, N);cudaEventRecord(ceEvStopKer1, csStream1); cudaEventRecord(ceEvStartCpyHst1, csStream1); //moved upcudaEventRecord(ceEvStartKer2, csStream2); VecAdd<<<blocksPerGrid, threadsPerBlock, 0, csStream2>>>(d_A2, d_A2, d_C2, N);cudaEventRecord(ceEvStopKer2, csStream2);cudaMemcpyAsync(h_A, d_C, size, cudaMemcpyDeviceToHost, csStream1);cudaEventRecord(ceEvStopCpyHst1, csStream1);cudaEventRecord(ceEvStartCpyHst2, csStream2);cudaMemcpyAsync(h_A2, d_C2, size, cudaMemcpyDeviceToHost, csStream2);cudaEventRecord(ceEvStopCpyHst2, csStream2);cudaEventRecord(ceEvStop, 0);cudaDeviceSynchronize();cudaEventElapsedTime( &fTimOverall2, ceEvStart, ceEvStop);printf("Scenario2 overall time= %10f\n", fTimOverall2);//Scenario3cudaDeviceSynchronize();cudaEventRecord(ceEvStart, 0);cudaEventRecord(ceEvStartCpyDev1, csStream1);cudaMemcpyAsync(d_A, h_A, size, cudaMemcpyHostToDevice, csStream1);cudaEventRecord(ceEvStopCpyDev1, csStream1);cudaEventRecord(ceEvStartCpyDev2, csStream2);cudaMemcpyAsync(d_A2, h_A2, size, cudaMemcpyHostToDevice, csStream2);cudaEventRecord(ceEvStopCpyDev2, csStream2);cudaStreamWaitEvent(csStream3, ceEvStopCpyDev1, 0);cudaEventRecord(ceEvStartKer1, csStream3); VecAdd<<<blocksPerGrid, threadsPerBlock, 0, csStream3>>>(d_A, d_A, d_C, N);cudaEventRecord(ceEvStopKer1, csStream3);cudaStreamWaitEvent(csStream4, ceEvStopCpyDev2, 0);cudaEventRecord(ceEvStartKer2, csStream4); VecAdd<<<blocksPerGrid, threadsPerBlock, 0, csStream4>>>(d_A2, d_A2, d_C2, N);cudaEventRecord(ceEvStopKer2, csStream4);cudaStreamWaitEvent(csStream1, ceEvStopKer1, 0);cudaEventRecord(ceEvStartCpyHst1, csStream1);cudaMemcpyAsync(h_A, d_C, size, cudaMemcpyDeviceToHost, csStream1);cudaEventRecord(ceEvStopCpyHst1, csStream1);cudaStreamWaitEvent(csStream2, ceEvStopKer2, 0);cudaEventRecord(ceEvStartCpyHst2, csStream2);cudaMemcpyAsync(h_A2, d_C2, size, cudaMemcpyDeviceToHost, csStream2);cudaEventRecord(ceEvStopCpyHst2, csStream2);cudaEventRecord(ceEvStop, 0);cudaDeviceSynchronize();cudaEventElapsedTime( &fTimOverall3, ceEvStart, ceEvStop);printf("Scenario3 overall time = %10f\n", fTimOverall3);cudaStreamDestroy(csStream1);cudaStreamDestroy(csStream2);cudaStreamDestroy(csStream3);cudaStreamDestroy(csStream4);cudaFree(d_A);cudaFree(d_C);cudaFreeHost(h_A);cudaFree(d_A2);cudaFree(d_C2);cudaFreeHost(h_A2);}int main(){ overlap();}Thank you very much for your time in advance! 解决方案 (Note, I'm more familiar with the Tesla series devices, and don't actually have a GT 555M to experiment with, so my results refer specifically to a C2070. I don't know how many copy engines the 555m has, but I expect the issues described below are what's causing the behavior you are seeing.)The issue is the lesser-known fact that the cudaEventRecords are CUDA operations too, and they also must be placed in one of the hardware queues before getting launched/executed. (A complicating factor is that, since cudaEventRecord is neither a copy operation, nor a compute kernel, it can actually go in any hardware queue. My understanding is that they usually go in the same hardware queue as the preceding CUDA operation of the same stream, but as this is not specified in the docs the actual operation may be device/driver dependent.)If I can extend your notation to use 'E' for 'Event record', and detail how the hardware queues are filled (similar to what is done in the "CUDA C/C++ Streams and Concurrency" webinar) then, in your Scenario 1 example, you have:Issue order for CUDA operations: ED1 D1 ED1 ED2 D2 ED2 ER1 R1 ER1 ...These fill the queues like:Hardware Queues: copyH2D Kernel ------- ------ ED1 * R1 D1 / ER1 ED1 / ... ED2 / D2 / ED2 / ER1 *and you can see that R1, by virtue of being in stream 1, will not execute until ER1 has completed, which won't happen until both D1 and D2 have completed since they are all serialized in the H2D copy queue.By moving the cudaEventRecord, ER1, up in Scenario 2, you avoid this since all CUDA operations in stream 1, prior to R1, complete before D2. This permits R1 to launch concurrently to D2.Hardware Queues: copyH2D Kernel ------- ------ ED1 * R1 D1 / ER1 ED1 / ... ER1 * ED2 D2 ED2 In your Scenario 3, the ER1 is replaced with an ER3. As this is the first operation in stream 3, it can go anywhere, and (guessing) goes in either the Kernel or copy D2H queue from which it could get launched immediately, (if you didn't have the cudaStreamWaitEvent(csStream3, ceEvStopCpyDev1, 0);for synchronization with stream 1) so it does not cause false serialization with D2.Hardware Queues: copyH2D Kernel ------- ------ ED1 * ER3 D1 / R3 ED1 * ER3 ED2 ... D2 ED2 My comments would beIssue order for CUDA operations is very important when considering concurrencycudaEventRecord, and similar operations, get placed on hardware queues like everything else and can cause false serialization. Exactly how they get placed in hardware queues is not well described, and could be device/driver dependent. So for optimal concurrency, the use of cudaEventRecord and similar operations should be reduced to the minimum necessary.If kernels need to be timed for performance studies, that can be done using events but it will break concurrency. This is fine for development but should be avoided for production code.However you should note that the upcoming Kepler GK110 (Tesla K20) devices make significant improvements in reducing false serialization by using 32 hardware queues. See the GK110 Whitepaper for details (page 17).Hope this helps. 这篇关于cudaEventRecord的位置和来自不同流的重叠ops的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

11-01 03:23