


In C, I have a task where I must do multiplication, inversion, trasposition, addition etc. etc. with huge matrices allocated as 2-dimensional arrays, (arrays of arrays).

我已经找到了gcc标志 -funroll-全循环。如果我理解正确的话,这将自动解开所有回路,而不由程序员的任何努力。

I have found the gcc flag -funroll-all-loops. If I understand correctly, this will unroll all loops automatically without any efforts by the programmer.


GCC是否包含这类的各种优化标志优化为 -O1 -O2 等。

a) Does gcc include this kind of optimization with the various optimization flags as -O1, -O2 etc.?


b) Do I have to use any pragmas inside my code to take advantage of loop unrolling or are loops identified automatically?


c) Why is this option not enabled by default if the unrolling increases the performance?

D)什么是推荐GCC的优化参数来编译最好的方式的计划? (我必须运行此程序为单个CPU系列进行了优化,这是一样的,我编译code的机器,其实我用行军=本地 -O2 标志)

d) What are the recommended gcc optimization flags to compile the program in the best way possible? (I must run this program optimized for a single CPU family, that is the same of the machine where I compile the code, actually I use march=native and -O2 flags)



Seems that there are controversities about the use of unroll that in some cases may slow down the performance. In my situations there are various methods that do simply math operations in 2 nested for cycles for iterate matrix elements done for an huge amount of elements. In this scenario how unroll could slow down or increase the performance?



Why unroll loops?

Modern processors pipeline instructions. They like knowing what's coming next and make all sorts of fancy optimisations based on assumptions of which order the instructions should be executed.


At the end of a loop though, there are two possibilities! Either you go back to the top, or continue on. The processor makes an educated guess on which is going to happen. If it gets it right, everything is good. If not, it has to flush the pipeline and stall for a bit while it prepares for taking the other branch.


As you can imagine, unrolling a loop eliminates branches and the potential for those stalls, especially in cases where the odds are against a guess.


Imagine a loop of code that executes 3 times, then continues. If you assume (as the processor probably would) that at the end you'll repeat the loop. 2/3 of the time, you'll be correct! 1/3 of the time though, you'll stall.


On the other hand, imagine the same situation, but the code loops 3000 times. Here, there's probably only a gain 1/3000 of the time from unrolling.


Part of the processor fanciness mentioned above involves loading the instructions from the executable in memory into the processor's onboard instruction cache (shortened to I-cache). This holds a limited amount of instructions which can be accessed quickly, but may stall when new instructions need to be loaded from memory.

让我们回到previous例子。假设code的相当少量的内环路占用I-cache中的 N 字节。如果我们展开循环,它现在占用 N * 3 字节。多一点,但它可能会适合在一个单一的高速缓存行就好使您的缓存将被优化工作,不需要来搪塞从主内存中读取。

Let's go back to the previous examples. Assume a reasonably small amount of code inside the loop takes up n bytes of I-cache. If we unroll the loop, it's now taking up n * 3 bytes. A bit more, but it'll probably fit in a single cache line just fine so your cache will be working optimally and not needing to stall reading from main memory.

3000环,然而,解开使用高达I-cache中的 N * 3000 字节。那将需要数从内存读取,并且可能是由其他地方的计划推其他一些有用的东西了I-cache中的。

The 3000-loop, however, unrolls to use a whopping n * 3000 bytes of I-cache. That's going to require several reads from memory, and probably push some other useful stuff from elsewhere in the program out of the I-cache.


As you can see, unrolling provides more benefits for shorter loops but ends up trashing performance if you're intending to loop a large number of times.


Usually, a smart compiler will take a decent guess about which loops to unroll but you can force it if you're sure you know better. How do you get to know better? The only way is to try it both ways and compare timings!

premature优化是一切罪恶的根源的 - 高德纳



09-03 06:47