I want a measure of how much of the peak memory bandwidth my kernel archives.
假设我有一个 NVIDIA Tesla C1060,它的 最大带宽为 102.4 GB/s.在我的内核中,我可以访问以下全局内存:
Say I have a NVIDIA Tesla C1060, which has a max Bandwidth of 102.4 GB/s. In my kernel I have the following accesses to global memory:
for(int k=0;k>4000;k++){
float result = (in_data[index]-loc_mem[k]) * (in_data[index]-loc_mem[k]);
我计算每个线程 4000*2+2 次访问全局内存.拥有 1.000.000 个线程并且所有访问都是浮动的,我有大约 32GB 的全局内存访问(添加了入站和出站).由于我的内核只需要 0.1 秒,我将归档 ~320GB/s,这高于最大带宽,因此我的计算/假设存在错误.我假设,CUDA 做了一些缓存,所以并不是所有的内存访问都计算在内.现在我的问题:
I count for each thread 4000*2+2 accesses to global memory. Having 1.000.000 threads and all accesses are float I have ~32GB of global memory accesses (inbound and outbound added). As my kernel only takes 0.1s I would archive ~320GB/s which is higher than the max bandwidth, thus there is an error in my calculations / assumptions. I assume, CUDA does some caching, so not all memory accesses count. Now my questions:
- 我的错误是什么?
- 哪些对全局内存的访问被缓存,哪些不被缓存?
- 不计算对寄存器、本地、共享和常量内存的访问是否正确?
- 我可以使用 CUDA 分析器获得更简单、更准确的结果吗?我需要使用哪些计数器?我需要如何解释它们?
method gputime cputime occupancy instruction warp_serial memtransfer
memcpyHtoD 10.944 17 16384
fill 64.32 93 1 14556 0
fill 64.224 83 1 14556 0
memcpyHtoD 10.656 11 16384
fill 64.064 82 1 14556 0
memcpyHtoD 1172.96 1309 4194304
memcpyHtoD 10.688 12 16384
cu_more_regT 93223.906 93241 1 40716656 0
memcpyDtoH 1276.672 1974 4194304
memcpyDtoH 1291.072 2019 4194304
memcpyDtoH 1278.72 2003 4194304
memcpyDtoH 1840 3172 4194304
新问题:- 当 4194304Bytes = 4Bytes * 1024*1024 数据点 = 4MB 并且 gpu_time
~= 0.1 s 时,我实现了 10*40MB/s = 400MB/s 的带宽.这似乎很低.哪里出错了?
New question:- When 4194304Bytes = 4Bytes * 1024*1024 data points = 4MB and gpu_time
~= 0.1 s then I achieve a bandwidth of 10*40MB/s = 400MB/s. That seems very low. Where is the error?
p.s. Tell me if you need other counters for your answer.
- 您实际上并没有同时运行 1.000.000 个线程.您执行约 32GB 的全局内存访问,其中带宽将由 SM 中运行(读取)的当前线程和读取的数据大小提供.
- 除非您向编译器指定未缓存的数据,否则全局内存中的所有访问都缓存在 L1 和 L2 中.
- 我想是的.实现的带宽与全局内存有关.
- 我建议使用可视化分析器来查看读/写/全局内存带宽.如果你发布你的结果会很有趣:).
Visual Profiler 中的默认计数器为您提供了足够的信息来了解您的内核(内存带宽、共享内存库冲突、执行的指令......).
Default counters in Visual Profiler gives you enough information to get an idea about your kernel (memory bandwidth, shared memory bank conflicts, instructions executed...).
Regarding to your question, to calculate the achieved global memory throughput:
计算视觉分析器.DU-05162-001_v02 |2010 年 10 月.用户指南.第 56 页,表 7.支持的派生统计数据.
全局内存读取吞吐量(以千兆字节/秒为单位).计算能力 2.0 计算为 (((gld_32*32) + (gld_64*64) + (gld_128*128)) * TPC)/gputime 对于计算能力 >= 2.0 这是计算as ((DRAM 读取) * 32)/gputime
Global memory read throughput in giga-bytes per second. For compute capability < 2.0 this is calculated as (((gld_32*32) + (gld_64*64) + (gld_128*128)) * TPC) / gputime For compute capability >= 2.0 this is calculated as ((DRAM reads) * 32) / gputime
这篇关于如何计算 CUDA 内核的实现带宽的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!