问题描述
瓦格纳雾的优化大会指南,12.7节:一个循环的例子。其中有一段讨论的例子code的:
## Example 12.6b. DAXPY algorithm, 32-bit mode
[...] ; not shown: initialize some regs before the loop
L1:
movapd xmm1, [esi+eax] ; X[i], X[i+1]
mulpd xmm1, xmm2 ; X[i] * DA, X[i+1] * DA
movapd xmm0, [edi+eax] ; Y[i], Y[i+1]
subpd xmm0, xmm1 ; Y[i]-X[i]*DA, Y[i+1]-X[i+1]*DA
movapd [edi+eax], xmm0 ; Store result
add eax, 16 ; Add size of two elements to index
cmp eax, ecx ; Compare with n*8
jl L1 ; Loop back
I cannot understand why the dependency chain doesn't increase a whole throughput. I know that it is only important to find the worst bottleneck. The worst bottleneck identified before considering dependency chains was fused-domain uop throughput, at 4.33 cycles per iteration. I cannot understand why the dependency chain isn't a bigger bottleneck than that.
I see that the author explains that it is connected with out-of-order execution and pipelining but I cannot see it. I mean, though, only multiplication causes latency 5 cycles so only this value is greater than 4 cycle.
I also cannot understand why the author doesn't care about the dependency here:
add eax, 16 -> cmp eax, ecx -> jl L1
After all, addition must be executed beforecmp
andcmp
must be executed beforejl
.
PS: later paragraphs identify the biggest bottleneck for Pentium M as decode, limiting it to one iteration per 6c, because 128b vector ops decode to two uops each. See Agner Fog's guide for the rest of the analysis, and analysis + tuning for Core2, FMA4 Bulldozer, and Sandybridge.
the mul isn't part of a loop-carried dependency chain, so there can be
mulpd
insns from multiple iterations in flight at once. The latency of a single instruction isn't the issue here at all, it's the dependency chain. Each iteration has a separate 13c dependency chain of load, mulpd, subpd, store. Out-of-order execution is what allows uops from multiple iterations to be in flight at once.The
cmp
/jl
in each iteration depend on theadd
from that iteration, but theadd
in the next iteration doesn't depend on thecmp
. Speculative execution and branch prediction mean that control dependencies (conditional branches and indirect jumps/calls) are not part of data dependency chains. This is why instructions from one iteration can start running before thejl
from the preceding iteration retires.By comparison,
cmov
is a data dependency instead of a control dependency, so branchless loops tend to have loop-carried dependency chains. This tends to be slower than branching if the branch predicts well.Each loop iteration has a separate
cmp
/jl
dependency chain, just like the FP dependency chain.
I have no idea what this sentence means. I think I was able to figure out all your other mixed up words and phrasing. (e.g. "chain dependency" instead of "dependency chain".) Have a look at my edits to your question; some of them might help your understanding, too.
这篇关于依赖链分析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!