本文介绍了如何调整块和线程的CUDA数以获得最佳性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经根据经验对块和线程的几个值进行了测试,并且执行时间可以大大减少具体的值。

I've tested empirically for several values of block and of thread, and the execution time can be greatly reduced with specific values.

我看不到块和线程之间的区别。我认为它可能是一个块中的线程具有特定的缓存内存,但它对我来说很模糊。目前,我在N个部分中并行化我的函数,这些函数在块/线程上分配。

I don't see what are the differences between blocks and thread. I figure that it may be that thread in a block have specific cache memory but it's quite fuzzy for me. For the moment, I parallelize my functions in N parts, which are allocated on blocks/threads.

我的目标是自动调整块的数量和线程关于我必须使用的内存大小。这是可能吗?谢谢。

My goal could be to automaticaly adjust the number of blocks and thread regarding to the size of the memory that I've to use. Could it be possible? Thank you.

推荐答案

Hong Zhou的回答很好,到目前为止。下面是一些更多的细节:

Hong Zhou's answer is good, so far. Here are some more details:

当使用共享内存时,你可能需要先考虑它,因为它是一个非常有限的资源,内核不太可能具有非常具体需要约束
这些许多控制并行性的变量。
您可以使用具有许多线程的块来共享更大的区域,或者使用更少的
线程共享更小的区域(在持续占用下)。

When using shared memory you might want to consider it first, because it's a very much limited resource and it's not unlikely for kernels to have very specific needs that constrainthose many variables controlling parallelism.You either have blocks with many threads sharing larger regions or blocks with fewerthreads sharing smaller regions (under constant occupancy).

每个多处理器只能有16KB的共享内存
您可能希望选择更大的(48KB)L1缓存调用

If your code can live with as little as 16KB of shared memory per multiprocessoryou might want to opt for larger (48KB) L1-caches calling

cudaDeviceSetCacheConfig(cudaFuncCachePreferL1);

此外,可以使用编译器选项<$ c,禁用L1缓存用于非本地全局访问$ c> -Xptxas = -dlcm = cg ,以避免在内核仔细访问全局内存时造成污染。

Further, L1-caches can be disabled for non-local global access using the compiler option -Xptxas=-dlcm=cg to avoid pollution when the kernel accesses global memory carefully.

占用你可能还需要检查
设备调试支持关闭CUDA> = 4.1(或给出适当的优化选项,请阅读我的帖子在)。

The higher the occupancy (warps per multiprocessor) the less likely the multiprocessor will have to wait (for memory transactions or data dependencies) but the more threads must share the same L1 caches, shared memory area and register file (see CUDA Optimization Guide and also this presentation).

ABI可以为可变数量的寄存器生成代码(更多细节可以在我引用的线程中找到)。然而,在某些时候,发生寄存器溢出。这是寄存器值临时存储在(相对慢的,片外)本地存储器堆栈上。

The ABI can generate code for a variable number of registers (more details can be found in the thread I cited). At some point, however, register spilling occurs. That is register values get temporarily stored on the (relatively slow, off-chip) local memory stack.

在分析器中观察停顿原因,内存统计和算术吞吐量,而
改变启动边界和参数将有助于您找到合适的配置。

Watching stall reasons, memory statistics and arithmetic throughput in the profiler whilevarying the launch bounds and parameters will help you find a suitable configuration.

理论上可以从应用程序中找到最佳值,然而,
具有客户端代码对不同设备和启动参数进行最佳调整
可以是非平凡的,将需要重新编译或为每个目标设备架构部署的内核的不同变体。

It's theoretically possible to find optimal values from within an application, however,having the client code adjust optimally to both different device and launch parameterscan be nontrivial and will require recompilation or different variants of the kernel to be deployed for every target device architecture.

这篇关于如何调整块和线程的CUDA数以获得最佳性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-19 12:35