问题描述
在第85页中,:
int main()
{
......
// run a warmup kernel to remove overhead
size_t iStart,iElaps;
cudaDeviceSynchronize();
iStart = seconds();
warmingup<<<grid, block>>> (d_C);
cudaDeviceSynchronize();
iElaps = seconds() - iStart;
printf("warmup <<< %4d %4d >>> elapsed %d sec \n",grid.x,block.x, iElaps );
// run kernel 1
iStart = seconds();
mathKernel1<<<grid, block>>>(d_C);
cudaDeviceSynchronize();
iElaps = seconds() - iStart;
printf("mathKernel1 <<< %4d %4d >>> elapsed %d sec \n",grid.x,block.x,iElaps );
// run kernel 3
iStart = seconds();
mathKernel2<<<grid, block>>>(d_C);
cudaDeviceSynchronize();
iElaps = seconds () - iStart;
printf("mathKernel2 <<< %4d %4d >>> elapsed %d sec \n",grid.x,block.x,iElaps );
// run kernel 3
iStart = seconds ();
mathKernel3<<<grid, block>>>(d_C);
cudaDeviceSynchronize();
iElaps = seconds () - iStart;
printf("mathKernel3 <<< %4d %4d >>> elapsed %d sec \n",grid.x,block.x,iElaps);
......
}
我们可以看到有一个在测量不同内核的运行时间之前进行预热。
We can see there is a warmup before measuring the running time of different kernels.
From GPU cards warming up?, I know the reason is:
因此,如果我的GPU卡长时间不处于活动状态,例如,我只是使用它来运行某些程序,则应该不需要运行任何预热代码。我的理解对吗?
So if my GPU card isn't inactive for a long time, e.g, I just use it to run some programs, it should not need to run any warmup code. Is my understanding right?
推荐答案
除了GPU处于省电状态之外,还有许多其他原因导致第一个内核的启动可能比进一步运行要慢:
Besides the GPU being in a power saving state there can be a number of other reasons why the first launch of a kernel could be slower than further runs:
- 及时编译
- 将内核转移到GPU内存
- 缓存内容
- ...
- just-in-time compilation
- transfer of kernel to GPU memory
- cache content
- ...
由于这些原因,如果您对连续内核启动所能达到的持续速度感兴趣,那么在定时内核运行之前至少执行一次预热运行始终是一个好习惯。
For these reasons it is always good practice to perform at least one "warmup run" before the timed kernel run, if you are interested in the sustained speed that consecutive kernel launches achieve.
但是,如果您有一个特定的应用程序和用例,那么在相关情况下对该应用程序进行基准测试总是有意义的。不过,要为在不受控制的测量中运行时的较大变化做好准备。
If however you have a specific application and use case in mind, it always makes sense to benchmark that application under the relevant circumstances. Be prepared though for much larger variations in runtime in that less controlled measurement.
这篇关于在测量CUDA内核运行时间时是否需要预热代码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!