cuda - 是否可以通过 nvprof(或其他方法)看到内核执行发生在 Tensor Core 上或不发生？

我正在尝试确定 Titan V/V100 上深度学习模型的 GPU 执行性能瓶颈。
我了解必须满足某些要求才能在基于 https://devblogs.nvidia.com/parallelforall/programming-tensor-cores-cuda-9/ 的 Tensor Cores 上执行底层内核

“nvprof”提供了一种将所有内核执行转储到 GPU 上的简单方法，但它似乎并没有说明是否实际使用了 Tensor Core。
这是一种捕获此类信息的方法吗？

最佳答案

根据 NVIDIA 提出的名为“Training Neural Networks with Mixed Precision”的 these slides，您可以使用 nvprof 来查看是否使用了 Tensor Core。

幻灯片的第 12 页基本上说使用 nvprof 运行程序并寻找“884”内核。

例如。

$ nvprof python test.py
...
37.024us 1 37.024us 37.024us 37.024us volta_fp16_s884gemm_fp16…

关于cuda - 是否可以通过 nvprof(或其他方法)看到内核执行发生在 Tensor Core 上或不发生？，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/47913943/