问题描述
我使用的是Tesla C2050,它具有2.0的计算能力,共享内存 48KB
。当我尝试使用这个共享内存时, nvcc
编译器给我以下错误
I am using Tesla C2050, which has a compute capability 2.0 and has shared memory 48KB
. BUt when I try to use this shared memory the nvcc
compiler gives me the following error
Entry function '_Z4SAT3PhPdii' uses too much shared data (0x8020 bytes + 0x10 bytes system, 0x4000 max)
$ b b
我的SAT1是扫描算法的天真的实现,并且因为我对图像大小的操作 4096x2160
我必须使用double来计算累积和。虽然 Tesla C2050
不支持double,但它仍然通过将其降级为float来执行任务。但是对于4096的图像宽度,共享内存大小更大的是16KB,但是它在48KB的限制内。
My SAT1 is the naive implementation of scan algorithm, and because I am operating on images sizes of the order 4096x2160
I have to use double to calculate the cumulative sum. Though Tesla C2050
does not support double, but it nevertheless does the task by demoting it to float. But for an image width of 4096 the shared memory size comes out to be greater 16KB but it is well within the 48KB limit.
任何人都可以帮助我理解这里发生了什么。我使用CUDA工具包3.0
Can anybody help me understand what is happening here. I am using CUDA toolkit 3.0
推荐答案
默认情况下,费米卡以兼容模式运行,具有16kb共享内存和48kb L1高速缓存每个多处理器。如果需要,可以使用API调用 cudaThreadSetCacheConfig
来更改GPU以运行48kb共享内存和16kb L1缓存。
By default, Fermi cards run in a compatibility mode, with 16kb shared memory and 48kb L1 cache per multiprocessor. The API call cudaThreadSetCacheConfig
can be used to change the GPU to run with 48kb shared memory and 16kb L1 cache, if you require it. You then must compile the code for compute capability 2.0 to avoid the code generation error you are seeing.
此外,您的Telsa C2050 支持双重格式精确。如果你得到有关降级双重编译器警告,这意味着你不是编译你的代码为正确的架构。添加
Also, your Telsa C2050 does support double precision. If you are getting compiler warnings about demoting doubles, it means you are not compiling your code for the correct architecture. Add
--arch=sm_20
到您的 nvcc
参数,GPU工具链将为您的Fermi卡编译,并将包括双精度支持和其他Fermi特定的硬件功能,包括更大的共享内存大小。
to your nvcc
arguments and the GPU toolchain will compile for your Fermi card, and will include double precision support and other Fermi specific hardware features, including larger shared memory size.
这篇关于入口函数使用过多的共享数据(0x8020字节+ 0x10字节系统,最大0x4000) - CUDA错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!