本文介绍了为什么在NVRTC中没有对整数除法和取模进行优化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在NVRTC中编译了一个内核:

I compiled a kernel in NVRTC:

__global__ void kernel_A(/* args */) {
    unsigned short idx = threadIdx.x;
    unsigned char warp_id = idx / 32;
    unsigned char lane_id = idx % 32;
    /* ... */
}

我知道在CUDA GPU上整数除法和取模非常昂贵.但是,我认为应该将这种除以2的除法方法进行优化,以进行位运算,直到发现不是这样为止:

I know integer division and modulo are very costly on CUDA GPUs. However I thought this kind of division-by-power-of-2 should be optimized into bit operations, until I found it isn't:

__global__ void kernel_B(/* args */) {
    unsigned short idx = threadIdx.x;
    unsigned char warp_id = idx >> 5;
    unsigned char lane_id = idx & 31;
    /* ... */
}

似乎kernel_B运行得更快.当省略内核中的所有其他代码时,以1024个大小为1024的块启动,nvprof显示kernel_A平均运行 15.2us ,而kernel_B运行 7.4us 平均而言.我推测NVRTC并没有优化整数除法和取模.

it seems kernel_B just runs faster. When omitting all other codes in kernel, launching with 1024 blocks of size 1024, nvprof shows kernel_A runs for 15.2us in average, while kernel_B runs 7.4us in average. I speculate NVRTC did not optimize out the integer division and modulo.

在GeForce 750 Ti,CUDA 8.0上获得的结果平均为100次调用.赋予nvrtcCompileProgram()的编译器选项是-arch compute_50.

The result is obtained on a GeForce 750 Ti, CUDA 8.0, averaged from 100 calls. The compiler options given to nvrtcCompileProgram() is -arch compute_50.

这是预期的吗?

推荐答案

在代码库中进行了彻底的错误检查.原来我的应用是在DEBUG模式下构建的.这会导致将其他标志-G-lineinfo传递给nvrtcCompileProgram()

Did a thorough bugsweep in the codebase. Turns out my app was built in DEBUG mode. This causes additional flags -G and -lineinfo passed to nvrtcCompileProgram()

nvcc手册页中:

生成设备代码的调试信息.关闭所有优化. 请勿用于剖析;改用-lineinfo.

Generate debug information for device code. Turns off all optimizations. Don't use for profiling; use -lineinfo instead.

这篇关于为什么在NVRTC中没有对整数除法和取模进行优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-13 06:23