问题描述
我有一个CUDA程序,其中一个块的线程在多次迭代中读取长数组的元素,并且内存访问几乎完全合并.当我分析时,全局负载效率超过100%(取决于输入,介于119%和187%之间). 全局负载效率的描述是"全局内存负载吞吐量与所需的全局内存负载吞吐量之比."这是否意味着我经常使用二级缓存和内存访问会从中受益吗?
I have a CUDA program in which threads of a block read elements of a long array in several iterations and memory accesses are almost fully coalesced. When I profile, Global Load Efficiency is over 100% (between 119% and 187% depending on the input). Description for Global Load Efficiency is "Ratio of global memory load throughput to required global memory load throughput." Does it mean that I'm hitting L2 cache a lot and my memory accesses are benefiting from it?
我的GPU是GeForce GTX 780(开普勒架构).
My GPU is GeForce GTX 780 (Kepler architecture).
推荐答案
我在NVIDIA论坛此处.我引用了我得到的答案:
I asked this question at NVIDIA forum here. I quote the answer I got:
全局负载效率和全局存储效率描述了DRAM访问和(L2?)高速缓存访问的合并工作情况.如果它们是100%,则您具有完美的合并.由于上述效率100%没有任何意义(您不可能比最优更好),这一定是一个错误.此错误是由Visual Profiler引起的,Visual Profiler对硬件事件进行计数以计算一些抽象指标.但是GPU没有正确"事件来准确计算所有这些指标,因此Visual Profiler必须使用一些复杂的公式和错误"事件来估算那些指标.有一些指标只是粗略的估计,全局负载效率和全局商店效率是其中两个.因此,如果这样的效率大于100%,那就是估计误差.据我观察,在我的某些寄存器溢出内核中,全局负载效率和全局存储效率都提高了100%以上.这就是为什么我假设Visual-Profiler使用某些事件(这也可能是由本地内存访问引起的)来计算这两个效率的原因.此外,GPU仅使用32位计数器.因此,长时间运行的内核往往会使这些计数器溢出,这也会导致Visual Profiler显示错误的指标."
"Global Load Efficiency and Global Store Efficiency describe how well the coalescing of DRAM-accesses and (L2?)Cache-accesses works. If they're 100 percent then you've got perfect coalescing. Since efficiencies above 100 percent don't make any sense (you cannot be better than optimal) this has to be an error.This error is caused by the Visual Profiler, which counts hardware events to calculate some abstract metrics. But the GPU doesn't have the "correct" events to exactly calculate all those metrics, thus Visual Profiler has to estimate those metrics by using some complex formula and "wrong" events. There are some metrics which are just rough estimations and Global Load Efficiency and Global Store Efficiency are two of them. Thus if such an efficiency is bigger than 100 percent it is an estimation error. As far as I observed the Global Load Efficiency and Global Store Efficiency both increased above 100 percent in some of my register spilling kernels. That's why i assume that the Visual-Profiler uses some events, which also may be caused by local memory accesses, to calculate those two efficiencies. Furthermore GPUs just uses 32 Bit Counters. Thus long running kernel tend to overflow those counters, which also causes the Visual Profiler to display wrong metrics."
这篇关于“全球负载效率"超过100%的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!