问题描述
我无法使用超过 48K 的共享内存(在 V100、Cuda 10.2 上)
I am unable to use more than 48K of shared memory (on V100, Cuda 10.2)
我打电话
cudaFuncSetAttribute(my_kernel,
cudaFuncAttributePreferredSharedMemoryCarveout,
cudaSharedmemCarveoutMaxShared);
在第一次启动 my_kernel
之前.
before launching my_kernel
first time.
我使用启动边界和 my_kernel
内的动态共享内存:
I use launch boundsand dynamic shared memory inside my_kernel
:
__global__
void __launch_bounds__(768, 1)
my_kernel(...)
{
extern __shared__ float2 sh[];
...
}
内核是这样调用的:
dim3 blk(32, 24); // 768 threads as in launch_bounds.
my_kernel<<<grd, blk, 64 * 1024, my_stream>>>( ... );
cudaGetLastError()
内核调用后返回cudaErrorInvalidValue
.
如果我使用 <= 48 K 的共享内存(例如,my_kernel<),它可以工作.
If I use <= 48 K of shared memory (e.g., my_kernel<<<grd, blk, 48 * 1024, my_stream>>>
), it works.
编译标志是:
nvcc -std=c++14 -gencode arch=compute_70,code=sm_70 -Xptxas -v,-dlcm=cg
我错过了什么?
推荐答案
来自 这里:
计算能力 7.x 设备允许单个线程块来处理共享内存的全部容量:Volta 上为 96 KB,图灵上为 64 KB.依赖于每个块超过 48 KB 的共享内存分配的内核是特定于架构的,因此它们必须使用动态共享内存(而不是静态大小的数组)并且需要使用 cudaFuncSetAttribute 显式选择加入()如下:
cudaFuncSetAttribute(my_kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, 98304);
当我将该行添加到您显示的代码中时,无效值错误就会消失.对于图灵设备,您可能希望将该数字从 98304 更改为 65536.当然 65536 对于您的示例也足够了,但不足以使用 volta 上可用的最大值,如问题标题中所述.
When I add that line to the code you have shown, the invalid value error goes away. For a Turing device, you would want to change that number from 98304 to 65536. And of course 65536 would be sufficient for your example as well, although not sufficient to use the maximum available on volta, as stated in the question title.
在 类似Ampere 设备上的时尚 内核应该能够使用高达 160KB 的共享内存 (cc 8.0) 或 100KB (cc 8.6),使用上述选择加入机制动态分配,数字 98304 更改为 163840 (例如,对于 cc 8.0)或 102400(对于 cc 8.6).
In a similar fashion kernels on Ampere devices should be able to use up to 160KB of shared memory (cc 8.0) or 100KB (cc 8.6), dynamically allocated, using the above opt-in mechanism, with the number 98304 changed to 163840 (for cc 8.0, for example) or 102400 (for cc 8.6).
请注意,以上内容涵盖了 Volta (7.0) Turing (7.5) 和 Ampere (8.x) 情况.在 7.x 之前具有计算能力的 GPU 无法处理每个线程块超过 48KB 的容量.在某些情况下,这些 GPU 的每个多处理器可能有更多的共享内存,但这是为了在某些线程块配置中允许更大的占用.程序员无法使用每个线程块超过 48KB.
Note that the above covers the Volta (7.0) Turing (7.5) and Ampere (8.x) cases. GPUs with compute capability prior to 7.x have no ability to address more than 48KB per threadblock. In some cases, these GPUs may have more shared memory per multiprocessor, but this is provided to allow for greater occupancy in certain threadblock configurations. The programmer has no ability to use more than 48KB per threadblock.
虽然它与此处提供的代码无关(它已经使用动态共享内存分配),但请注意摘录文档引用中的内容,即在支持它的设备上使用超过 48KB 的共享内存需要两件事:
Although it doesn't pertain to the code presented here (which is already using a dynamic shared memory allocation), note from the excerpted documentation quote that using more than 48KB of shared memory on devices that support it requires 2 things:
- 上面已经描述的选择加入机制
- 动态而不是静态共享内存分配在内核代码中.
- The opt-in mechanism already described above
- A dynamic rather than static shared memory allocation in the kernel code.
动态示例:
extern __shared__ int shared_mem[];
静态示例:
__shared__ int shared_mem[1024];
动态分配的共享内存还需要在内核启动配置参数中传递一个大小(问题中给出了一个示例).
Dynamically allocated shared memory also requires a size to be passed in the kernel launch configuration parameters (an example is given in the question).
这篇关于在 Cuda 中使用最大共享内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!