


I have written an application in cuda , which uses 1kb of shared memory in each block.Since there is only 16kb of shared memory in each SM, so only 16 blocks can be accommodated overall ( am i understanding it correctly ?), though at a time only 8 can be scheduled, but now if some block is busy in doing memory operation, so other block will be scheduled on gpu, but all the shared memory is used by other 16 blocks which already been scheduled there, so will cuda will not scheduled more blocks on the same sm , unless previous allocated blocks are completely finished ? or it will move some block's shared memory to global memory, and allocated other block there (in this case should we worry about global memory access latency ?)



It does not work like that. The number of blocks which will be scheduled to run at any given moment on a single SM will always be the minimum of the following:

  1. 8块

  2. 静态和动态分配的共享内存之和小于16kb或48kb的块数,具体取决于GPU体系结构和设置。还有共享内存页大小限制,这意味着每个块分配被取整为页大小的下一个最大倍数。

  3. 每个块寄存器使用总和小于的块数8192/16384/32678。还有注册文件页面大小,这意味着每个块的分配被舍入到页面大小的下一个最大倍数。

是所有有它。没有共享内存的分页来容纳更多的块。 NVIDIA制作了一个用于计算占用的电子表格,它随工具包一起提供并作为单独的下载。您可以在其包含的公式中查看确切的规则。它们也在CUDA编程指南的第4.2节中讨论。

That is all there is to it. There is no "paging" of shared memory to accomodate more blocks. NVIDIA produce a spreadsheet for computing occupancy which ships with the toolkit and is available as a separate download. You can see the exact rules in the formulas it contains. They are also discussed in section 4.2 of the CUDA programming guide.


07-18 21:34