本文介绍了在什么类型的循环是它最好使用CUDA中的#pragma unroll指令?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在CUDA可以展开使用的#pragma unroll指令以通过提高指令级并行性能环路。在的#pragma 可以选择后跟一个数字,指定多少次循环必须展开。

In CUDA it is possible to unroll loops using the #pragma unroll directive to improve performance by increasing instruction level parallelism. The #pragma can optionally be followed by a number that specifies how many times the loop must be unrolled.

不幸的是,文档不给时,应使用此指令的具体方向。由于与已知的行程计数的小循环已经由编译器展开,应的#pragma 展开可以用在大型循环使用吗?在小循环具有可变柜台?而关于将展开的可选数字是什么?也有关于CUDA特定循环展开推荐文档?

Unfortunately the docs do not give specific directions on when this directive should be used. Since small loops with a known trip count are already unrolled by the compiler, should #pragma unroll be used on larger loops? On small loops with a variable counter? And what about the optional number of unrolls? Also is there recommended documentation about cuda specific loop unrolling?

推荐答案

有没有任何又快又狠的规则。在CUDA编译器具有至少两个开卷,每一个所述NVVM或Open64前端内侧,一个在PTXAS后端。一般情况下,他们往往会积极地展开循环pretty,所以我觉得用为#pragma unroll 1 (以prevent展开)自己往往比其他任何展开属性。对关闭循环展开的原因有两方面:

There aren't any fast and hard rules. The CUDA compiler has at least two unrollers, one each inside the NVVM or Open64 frontends, and one in the PTXAS backend. In general, they tend to unroll loops pretty aggressively, so I find myself using #pragma unroll 1 (to prevent unrolling) more often than any other unrolling attribute. The reasons for turning off loop unrolling are twofold:

(1)当一个循环将完全展开,注册pressure可以增加。例如,索引成小本地存储器阵列可以成为编译时间常数,允许编译器对本地数据放入寄存器。完整展开也可趋于拉长的基本块,从而使质地和全局负载的更积极的调度,这可能需要额外的临时变量,因而寄存器。增加寄存器pressure可导致因注册溢出降低性能。

(1) When a loop is unrolled completely, register pressure can increase. For example, indexes into small local memory arrays may become compile-time constants, allowing the compiler to place the local data into registers. Complete unrolling may also tends to lengthen basic blocks, allowing more aggressive scheduling of texture and global loads, which may require additional temporary variables and thus registers. Increased register pressure can lead to lower performance due to register spilling.

(2)部分展开循环通常需要一定量$ P $对计算和清理code来处理不属于展开因子的exacty多个循环计数。对于短行程计数的循环,这方面的开销可以淹没任何的性能提升到从展开的循环了,导致展开后性能降低。虽然编译器包含启发式根据这些限制,寻找合适的环路,启发式不能总是提供个最好的决定。

(2) Partially unrolled loops usually require a certain amount of pre-computation and clean-up code to handle loop counts that are not an exacty multiple of the unrolling factor. For loops with short trip counts, this overhead can swamp any performance gains to be had from the unrolled loop, leading to lower performance after unrolling. While the compiler contains heuristics for finding suitable loops under these restrictions, the heuristics can't always provide th best decision.

在极少数情况下我发现,提供手动比自动使用编译器对性能有小的有益作用(在个位数百分比典型增益)较高的展开因素。这些都是典型的有内存密集型code情况下,较大的展开因素使得全球或纹理负载更积极的调度,或者很紧约束的计算循环,从循环的开销最小化中受益。

In rare cases I have found that manually providing a higher unrolling factor than what the compiler used automatically has a small beneficial effect on performance (with typical gain in the single digit percent). These are typically cases of memory-intensive code where a larger unrolling factor allows more aggressive scheduling of global or texture loads, or very tight computationally bound loops that benefit from minimization of the loop overhead.

与展开的因素玩的东西,应在优化过程中发生的后期,由于编译器的默认覆盖大多数情况下,人会在实践中遇到的问题。

Playing with unrolling factors is something that should happen late in the optimization process, as the compiler defaults cover most cases one will encounter in practice.

这篇关于在什么类型的循环是它最好使用CUDA中的#pragma unroll指令?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-03 06:12