


Say you declare a new variable in a CUDA kernel and then use it in multiple threads, like:

__global__ void kernel(float* delt, float* deltb) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
float a;
a = delt[i] + deltb[i];
a += 1;

and the kernel call looks something like below, with multiple threads and blocks:

int threads = 200;
uint3 blocks = make_uint3(200,1,1);
kernel<<<blocks,threads>>>(d_delt, d_deltb);

None of the above. The CUDA compiler is smart enough and aggressive enough with optimisations that it can detect that a is unused and the complete code can be optimised away.You can confirm this by compiling the kernel with -Xptxas=-v as an option and look at the resource count, which should be basically no registers and no local memory or heap.

In a less trivial example, a would probably be stored in a per thread register, or in per thread local memory, which is off-die DRAM.


