如何计算 CUDA 内核的实现带宽

本文介绍了如何计算 CUDA 内核的实现带宽的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想要衡量我的内核存档的峰值内存带宽有多少.

I want a measure of how much of the peak memory bandwidth my kernel archives.

假设我有一个 NVIDIA Tesla C1060，它的最大带宽为 102.4 GB/s.在我的内核中，我可以访问以下全局内存:

Say I have a NVIDIA Tesla C1060, which has a max Bandwidth of 102.4 GB/s. In my kernel I have the following accesses to global memory:

    ...
    for(int k=0;k>4000;k++){
        float result = (in_data[index]-loc_mem[k]) * (in_data[index]-loc_mem[k]);
        ....
    }
    out_data[index]=result;
    out_data2[index]=sqrt(result);
    ...

我计算每个线程 4000*2+2 次访问全局内存.拥有 1.000.000 个线程并且所有访问都是浮动的，我有大约 32GB 的全局内存访问(添加了入站和出站).由于我的内核只需要 0.1 秒，我将归档 ~320GB/s，这高于最大带宽，因此我的计算/假设存在错误.我假设，CUDA 做了一些缓存，所以并不是所有的内存访问都计算在内.现在我的问题:

I count for each thread 4000*2+2 accesses to global memory. Having 1.000.000 threads and all accesses are float I have ~32GB of global memory accesses (inbound and outbound added). As my kernel only takes 0.1s I would archive ~320GB/s which is higher than the max bandwidth, thus there is an error in my calculations / assumptions. I assume, CUDA does some caching, so not all memory accesses count. Now my questions:

我的错误是什么?
哪些对全局内存的访问被缓存，哪些不被缓存?
不计算对寄存器、本地、共享和常量内存的访问是否正确?
我可以使用 CUDA 分析器获得更简单、更准确的结果吗?我需要使用哪些计数器?我需要如何解释它们?

分析器输出:

method              gputime    cputime  occupancy instruction warp_serial memtransfer
memcpyHtoD           10.944         17                                          16384
fill                  64.32         93          1       14556           0
fill                 64.224         83          1       14556           0
memcpyHtoD           10.656         11                                          16384
fill                 64.064         82          1       14556           0
memcpyHtoD          1172.96       1309                                        4194304
memcpyHtoD           10.688         12                                          16384
cu_more_regT      93223.906      93241          1    40716656           0
memcpyDtoH         1276.672       1974                                        4194304
memcpyDtoH         1291.072       2019                                        4194304
memcpyDtoH          1278.72       2003                                        4194304
memcpyDtoH             1840       3172                                        4194304

新问题:- 当 4194304Bytes = 4Bytes * 1024*1024 数据点 = 4MB 并且 gpu_time ~= 0.1 s 时，我实现了 10*40MB/s = 400MB/s 的带宽.这似乎很低.哪里出错了?

New question:- When 4194304Bytes = 4Bytes * 1024*1024 data points = 4MB and gpu_time ~= 0.1 s then I achieve a bandwidth of 10*40MB/s = 400MB/s. That seems very low. Where is the error?

附言如果您需要其他计数器来回答，请告诉我.

p.s. Tell me if you need other counters for your answer.

妹子问题:如何计算内核的Gflops

your

如何计算 CUDA 内核的实现带宽

问题描述

推荐答案