问题描述
考虑以下x86-64程序集:
inner:
...
ret
outer:
.top:
call inner
dec rdi
jnz .top
ret
函数outer
只是重复地对函数inner
进行call
(其主体未显示-可能为空).
outer
中的call
指令系列和inner
内部的相应ret
指令是否在实践中形成了从属链(出于评估性能的目的)?
可以通过多种方式形成此链.例如,ret
是否依赖于前面的call
指令的等待时间,然后随后的call
指令是否依赖于ret
并形成call -> ret -> call
链?还是ret
是独立的,但call
不是独立的,形成了call -> call
链?如果有链,是通过内存,寄存器,堆栈引擎,返回地址预测变量还是什么?
动机:该问题源自对另一个问题的一系列评论,主要是和更早的评论.
这里的术语可能不清楚:通常将堆栈引擎理解为将rsp
修改指令转换为具有适当偏移量的单个访问,因此push rax; push rbx
可能会转换为类似mov [t0], rax; mov [t0 - 8], rbx
的形式,其中t0
是某个临时寄存器,该寄存器在某个时刻捕获了rsp
的值.还可以理解为call
和ret
指令处理类似的转换,它们均以类似于push
和pop
的方式修改堆栈,并包括直接,间接(分别)跳转. CPU 还包含一种预测返回间接跳转的机制,该间接跳转在堆栈引擎"下有些笨拙-但在这里,我将其分离为返回地址预测器".
否,分支预测+投机执行打破了存储/重载依赖性.
从返回地址预测器中,(推测地)RIP由前端(推测地)知道.因此,下一条call
指令可以推送一个返回地址,而无需等待ret
执行(并且实际上根据堆栈中的数据加载并验证了预测的返回地址的正确性).推测性商店可以进入商店缓冲区并进行商店转发.
当然有一个依赖关系链,它不是循环携带的.乱序执行会使许多迭代一直在进行中,从而将其隐藏起来.
证明: call
的存储区损坏否则将是循环承载的内存依赖关系链.
align 64
global _start
_start:
mov ebp, 250000000 ; I had been unrolling by 4, should have changed this to 5000... before measuring, but forgot.
align 32
.mainloop:
call delay_retaddr
call delay_retaddr
dec ebp
jg .mainloop
xor edi,edi
mov eax,231 ; __NR_exit_group from /usr/include/asm/unistd_64.h
syscall ; sys_exit_group(0)
;; Placing this function *before* _start, or increasing the alignment,
;; makes it somewhat slower!
align 32
delay_retaddr:
add qword [rsp], 0
add qword [rsp], 0 ; create latency for the ret addr
ret
组装并链接到yasm -felf64 -Worphan-labels -gdwarf2 foo.asm && ld -o foo foo.o
,生成静态ELF二进制文件.
使用 ocperf.py (在i7-6700k上)进行配置,我得到每个核心时钟周期0.99条指令:
$ taskset -c 3 ocperf.py stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread,dsb2mite_switches.penalty_cycles -r2 ./foo
Performance counter stats for './foo' (2 runs):
645.770390 task-clock (msec) # 1.000 CPUs utilized ( +- 0.05% )
1 context-switches # 0.002 K/sec ( +-100.00% )
0 cpu-migrations # 0.000 K/sec
2 page-faults # 0.004 K/sec ( +- 20.00% )
2,517,412,984 cycles # 3.898 GHz ( +- 0.09% )
1,250,159,413 branches # 1935.919 M/sec ( +- 0.00% )
2,500,838,090 instructions # 0.99 insn per cycle ( +- 0.00% )
4,010,093,750 uops_issued_any # 6209.783 M/sec ( +- 0.03% )
7,010,150,784 uops_executed_thread # 10855.485 M/sec ( +- 0.02% )
62,838 dsb2mite_switches_penalty_cycles # 0.097 M/sec ( +- 30.92% )
0.645899414 seconds time elapsed ( +- 0.05% )
使用_start
之前的调用函数和128
的对齐值,IPC可以从0.99降至0.84,这是非常奇怪的. dsb2mite交换机的计数仍然偏低,因此它仍主要通过uop缓存运行,而不是传统的解码器. (此Skylake CPU的微代码更新会禁用循环缓冲区,以防与所有此类跳转有关.)
为了保持良好的吞吐量,CPU必须保持内部循环的多次迭代,因为我们已经大大延长了需要重叠的独立dep链.
将add [rsp], 0
指令更改为[rsp+16]
会在其他位置创建循环承载的依赖链,call
不会将其存储到该链中.因此,循环在存储转发延迟上遇到瓶颈,并以大约一半的速度运行.
# With add qword [rsp+16], 0
Performance counter stats for './foo' (2 runs):
1212.339007 task-clock (msec) # 1.000 CPUs utilized ( +- 0.04% )
2 context-switches # 0.002 K/sec ( +- 60.00% )
0 cpu-migrations # 0.000 K/sec
2 page-faults # 0.002 K/sec
4,727,361,809 cycles # 3.899 GHz ( +- 0.02% )
1,250,292,058 branches # 1031.306 M/sec ( +- 0.00% )
2,501,537,152 instructions # 0.53 insn per cycle ( +- 0.00% )
4,026,138,227 uops_issued_any # 3320.967 M/sec ( +- 0.02% )
7,026,457,222 uops_executed_thread # 5795.786 M/sec ( +- 0.01% )
230,287 dsb2mite_switches_penalty_cycles # 0.190 M/sec ( +- 68.23% )
1.212612110 seconds time elapsed ( +- 0.04% )
请注意,我仍然使用相对于RSP的地址,因此仍然存在堆栈同步uop.通过使用相对于不同寄存器的地址(例如rbp
)来寻址call
/ret
存储/重载返回地址的位置,我本可以保持两种情况不变,并避免两种情况发生.
我认为存储转发的可变等待时间(在简单的立即背对背重新加载情况下更差)不足以解释这种差异. 添加冗余分配可在不使用编译器的情况下加速代码优化.这是从打破依赖关系到2加速的一个因素. ( 0.99 IPC与0.53 IPC,具有相同的指令,只是寻址方式不同.)
在寻址模式下,使用disp8
时指令要长1个字节,并且在较快版本中存在前端对齐的怪异,但是在[rsp+16]
版本中,移动内容似乎并没有改变
使用创建存储转发停顿的版本(add dword [rsp], 0
)会使dep链太长,以至于OoO执行人员不容易隐藏.我并没有玩这么多.
Consider the following x86-64 assembly:
inner:
...
ret
outer:
.top:
call inner
dec rdi
jnz .top
ret
The function outer
simply repeatedly makes a call
to the function inner
(whose body isn't shown - it may be empty).
Does the series of call
instructions in outer
, and the corresponding ret
instructions inside inner
form a dependent chain in practice (for the purposes of estimating performance)?
There is more than one way this chain could be formed. For example, does the ret
depend on the latency of the preceding call
instruction and then does the subsequent call
instruction depend on the ret
, forming a call -> ret -> call
chain? Or perhaps the ret
is independent but the call
is not, forming a call -> call
chain? If there is a chain, is it through memory, a register, the stack engine, the return address predictor, or what?
Motivation: This question originated from a series of comments on another question, mostly this comment and earlier ones.
The terminology might be somewhat unclear here: the stack engine is normally understood to handle transforming rsp
-modifying instructions into a single access with an appropriate offset, so that push rax; push rbx
might be transformed into something like mov [t0], rax; mov [t0 - 8], rbx
where t0
is some temporary register that captured the value of rsp
at some point. It also understood to handle a similar transformation for call
and ret
instructions, which both modify the stack in a way similar to push
and pop
as well as including a direct, indirect (respectively) jump. The CPU also includes a mechanism to predict that return indirect jump, which some lump under "stack engine" - but here I'm separating that out into "return address predictor".
No, branch-prediction + speculative execution break the store/reload dependency.
RIP is (speculatively) known by the front-end, from the return-address predictor. The next call
instruction can thus push a return address without waiting for the ret
to execute (and actually load and verity the correctness of the predicted return address against the data from the stack).
Speculative stores can enter the store buffer and be store-forwarded.
There is of course a dependency chain, it's not loop-carried. Out-of-order execution hides it by keeping many iterations in flight.
Proof: call
's store breaks what would otherwise be a loop-carried memory dependency chain.
align 64
global _start
_start:
mov ebp, 250000000 ; I had been unrolling by 4, should have changed this to 5000... before measuring, but forgot.
align 32
.mainloop:
call delay_retaddr
call delay_retaddr
dec ebp
jg .mainloop
xor edi,edi
mov eax,231 ; __NR_exit_group from /usr/include/asm/unistd_64.h
syscall ; sys_exit_group(0)
;; Placing this function *before* _start, or increasing the alignment,
;; makes it somewhat slower!
align 32
delay_retaddr:
add qword [rsp], 0
add qword [rsp], 0 ; create latency for the ret addr
ret
Assemble and link with yasm -felf64 -Worphan-labels -gdwarf2 foo.asm && ld -o foo foo.o
, producing a static ELF binary.
Profiled (on an i7-6700k) with ocperf.py, I get 0.99 instructions per core clock cycle:
$ taskset -c 3 ocperf.py stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread,dsb2mite_switches.penalty_cycles -r2 ./foo
Performance counter stats for './foo' (2 runs):
645.770390 task-clock (msec) # 1.000 CPUs utilized ( +- 0.05% )
1 context-switches # 0.002 K/sec ( +-100.00% )
0 cpu-migrations # 0.000 K/sec
2 page-faults # 0.004 K/sec ( +- 20.00% )
2,517,412,984 cycles # 3.898 GHz ( +- 0.09% )
1,250,159,413 branches # 1935.919 M/sec ( +- 0.00% )
2,500,838,090 instructions # 0.99 insn per cycle ( +- 0.00% )
4,010,093,750 uops_issued_any # 6209.783 M/sec ( +- 0.03% )
7,010,150,784 uops_executed_thread # 10855.485 M/sec ( +- 0.02% )
62,838 dsb2mite_switches_penalty_cycles # 0.097 M/sec ( +- 30.92% )
0.645899414 seconds time elapsed ( +- 0.05% )
With the called function before _start
, and alignment values of 128
, IPC can go down from 0.99 to 0.84, which is super-weird. Counts for dsb2mite switches are still low-ish, so it's mostly still running from the uop cache, not the legacy decoders. (This Skylake CPU has the microcode update that disables the loop buffer, in case that would be relevant with all this jumping.)
To sustain good throughput, the CPU has to keep many iterations of the inner loop in flight because we've significantly lengthened the independent dep chains that need to overlap.
Changing the add [rsp], 0
instructions to [rsp+16]
creates a loop-carried dependency chain on a different location, which isn't being stored to by call
. So the loop bottlenecks on that store-forwarding latency and runs at ~half speed.
# With add qword [rsp+16], 0
Performance counter stats for './foo' (2 runs):
1212.339007 task-clock (msec) # 1.000 CPUs utilized ( +- 0.04% )
2 context-switches # 0.002 K/sec ( +- 60.00% )
0 cpu-migrations # 0.000 K/sec
2 page-faults # 0.002 K/sec
4,727,361,809 cycles # 3.899 GHz ( +- 0.02% )
1,250,292,058 branches # 1031.306 M/sec ( +- 0.00% )
2,501,537,152 instructions # 0.53 insn per cycle ( +- 0.00% )
4,026,138,227 uops_issued_any # 3320.967 M/sec ( +- 0.02% )
7,026,457,222 uops_executed_thread # 5795.786 M/sec ( +- 0.01% )
230,287 dsb2mite_switches_penalty_cycles # 0.190 M/sec ( +- 68.23% )
1.212612110 seconds time elapsed ( +- 0.04% )
Note that I'm still using an RSP-relative address so there's still a stack-sync uop. I could have kept both cases the same and avoided it in both by using an address relative to a different register (e.g. rbp
) to address the location where call
/ret
store/reload the return address.
I don't think the variable latency of store-forwarding (worse in simple back-to-back reload right away cases) is sufficient to explain the difference. Adding a redundant assignment speeds up code when compiled without optimization. This is a factor of 2 speedup from breaking the dependency. (0.99 IPC vs. 0.53 IPC, with the same instructions just different addressing mode.)
The instructions are 1 byte longer with the disp8
in the addressing mode, and there was front-end weirdness with alignment in the faster version, but moving things around doesn't seem to change anything with the [rsp+16]
version.
Using a version that creates a store-forwarding stall (with add dword [rsp], 0
) makes the dep chain too long for OoO exec to hide easily. I didn't play around with this a huge amount.
这篇关于一系列x86调用/退出指令是否形成从属链?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!