问题描述
Haswell及更早版本上的ADC通常为2 uops,具有2个周期的延迟,因为传统上Intel uops只能有2个输入( https://agner.org/optimize/).在Haswell为FMA和.
(但是BDW/SKL仍然对adc al, imm8
短格式编码使用2 uops,或者其他al/ax/eax/rax,imm8/16/32/32短格式,没有ModRM.更多详细信息,请参见答案.
但是具有立即数0的 adc
在Haswell上是特殊情况,只能解码为单个uop. ,并包括了对此性能怪癖在他的uarch-bench中: https://github.com/travisdowns/uarch-bench . Haswell服务器上CI的示例输出显示adc reg,0
和adc reg,1
或adc reg,zeroed-reg
之间的差异.
(但仅适用于32或64位操作数大小,不适用于adc bl,0
.因此请使用32位与SBB相同.据我所知,对于具有相同立即值的等效编码,在任何CPU上ADC和SBB的性能之间都没有任何区别.
什么时候引入了针对imm=0
的优化?
我在Core 2 上进行了测试,发现adc eax,0
延迟为2个周期,与adc eax,3
相同.而且,对于0
与3
的吞吐量测试的一些变体,周期数是相同的,因此第一代Core 2(Conroe/Merom)没有进行此优化.
回答此问题的最简单方法可能是在Sandybridge系统上使用下面的测试程序,并查看adc eax,0
是否比adc eax,1
快.但是基于可靠文档的答案也可以.
脚注1 :我在运行Linux的Core 2 E6600(Conroe/Merom)上使用了此测试程序.
;; NASM / YASM
;; assemble / link this into a 32 or 64-bit static executable.
global _start
_start:
mov ebp, 100000000
align 32
.loop:
xor ebx,ebx ; avoid partial-flag stall but don't break the eax dependency
%rep 5
adc eax, 0 ; should decode in a 2+1+1+1 pattern
add eax, 0
add eax, 0
add eax, 0
%endrep
dec ebp ; I could have just used SUB here to avoid a partial-flag stall
jg .loop
%ifidn __OUTPUT_FORMAT__, elf32
;; 32-bit sys_exit would work in 64-bit executables on most systems, but not all. Some, notably Window's subsystem for Linux, disable IA32 compat
mov eax,1
xor ebx,ebx
int 0x80 ; sys_exit(0) 32-bit ABI
%else
xor edi,edi
mov eax,231 ; __NR_exit_group from /usr/include/asm/unistd_64.h
syscall ; sys_exit_group(0)
%endif
Linux perf
在像Core 2这样的旧CPU上不能很好地工作(它不知道如何访问诸如uops之类的所有事件),但是它确实知道如何读取HW计数器的周期和指令.足够了.
我是用
构建并分析的 yasm -felf64 -gdwarf2 testloop.asm
ld -o testloop-adc+3xadd-eax,imm=0 testloop.o
# optional: taskset pins it to core 1 to avoid CPU migrations
taskset -c 1 perf stat -e task-clock,context-switches,cycles,instructions ./testloop-adc+3xadd-eax,imm=0
Performance counter stats for './testloop-adc+3xadd-eax,imm=0':
1061.697759 task-clock (msec) # 0.992 CPUs utilized
100 context-switches # 0.094 K/sec
2,545,252,377 cycles # 2.397 GHz
2,301,845,298 instructions # 0.90 insns per cycle
1.069743469 seconds time elapsed
0.9 IPC是这里有趣的数字.
这是关于我们希望从具有2 uop/2c延迟的静态分析中得到的结果.adc
:(5*(1+3) + 3) = 23
循环中的指令,5*(2+3) = 25
延迟周期=每个循环迭代的周期. 23/25 = 0.92.
在Skylake上为1.15. (5*(1+3) + 3) / (5*(1+3)) = 1.15
,即额外的.15来自异或零和dec/jg,而adc/add链以每个时钟正好1 uop的速度运行,从而导致延迟瓶颈.我们也希望其他任何具有单周期延迟adc
的uarch上的总体IPC为1.15,因为前端不是瓶颈. (按顺序排列的Atom和P5 Pentium会稍低一些,但是xor和dec可以与adc配对或在P5上添加.)
在SKL上,uops_issued.any
= instructions
= 2.303G,确认adc
是单uop(无论立即数具有什么值,它始终在SKL上).偶然地,jg
是新缓存行中的第一条指令,因此它不会在SKL上与dec
进行宏融合.改为使用dec rbp
或sub ebp,1
时,uops_issued.any
是预期的2.2G.
这是非常可重复的:perf stat -r5
(运行5次并显示平均值+方差),并且多次运行,表明循环计数可重复至1000的1倍.会带来更大的差异.
使用0
以外的立即数重建可执行文件不会完全更改Core 2上的时间 ,这是一个没有特殊情况的有力信号.绝对值得测试.
我最初是在查看吞吐量(在每次循环迭代之前使用xor eax,eax
,让OoO exec重叠迭代),但是很难排除前端效果.我想我最终通过添加单uup add
指令来避免出现前端瓶颈.内循环的吞吐量测试版本如下所示:
xor eax,eax ; break the eax and CF dependency
%rep 5
adc eax, 0 ; should decode in a 2+1+1+1 pattern
add ebx, 0
add ecx, 0
add edx, 0
%endrep
这就是为什么延迟测试版本看起来有点怪异的原因.但是无论如何,请记住,Core2没有解码uop缓存,并且其循环缓冲区处于预解码阶段(在找到指令边界之后). 4个解码器中只有1个可以解码多uop指令,因此adc
是前端的多uop瓶颈.我想我可以用times 5 adc eax, 0
来做到这一点,因为管道的某些后期阶段不太可能能够在不执行该uop的情况下抛出该uop.
Nehalem的循环缓冲区回收解码的uops,并避免了背对背多uu指令的解码瓶颈.
根据我的微基准测试,其结果可以在 uops.info ,此优化是由Sandy Bridge( http ://uops.info/html-tp/SNB/ADC_R64_I8-Measurements.html ). Westmere不会进行此优化( http://uops.info/html-tp/WSM/ADC_R64_I8-Measurements.html ).数据是使用Core i7-2600和Core i5-650获得的.
此外, uops.info 上的数据显示,如果使用8位寄存器,则不会执行优化被使用(桑迪桥,常春藤, Haswell ).
ADC on Haswell and earlier is normally 2 uops, with 2 cycle latency, because Intel uops traditionally could only have 2 inputs (https://agner.org/optimize/). Broadwell / Skylake and later have single-uop ADC/SBB/CMOV, after Haswell introduced 3-input uops for FMA and micro-fusion of indexed addressing modes in some cases.
(But BDW/SKL still uses 2 uops for the adc al, imm8
short-form encoding, or the other al/ax/eax/rax, imm8/16/32/32 short forms with no ModRM. More details in my answer.)
But adc
with immediate 0 is special-cased on Haswell to decode as only a single uop. @BeeOnRope tested this, and included a check for this performance quirk in his uarch-bench: https://github.com/travisdowns/uarch-bench. Sample output from CI on a Haswell server showing a difference between adc reg,0
and adc reg,1
or adc reg,zeroed-reg
.
(But only for 32 or 64-bit operand-size, not adc bl,0
. So use 32-bit when using adc on a setcc result to combine 2 conditions into one branch.)
Same for SBB. As far as I've seen, there's never any difference between ADC and SBB performance on any CPU, for the equivalent encoding with the same immediate value.
When was this optimization for imm=0
introduced?
I tested on Core 2, and found that adc eax,0
latency is 2 cycles, same as adc eax,3
. And also the cycle count is identical for a few variations of throughput tests with 0
vs. 3
, so first-gen Core 2 (Conroe/Merom) doesn't do this optimization.
The easiest way to answer this is probably to use my test program below on a Sandybridge system, and see if adc eax,0
is faster than adc eax,1
. But answers based on reliable documentation would be fine, too.
Footnote 1: I used this test program on my Core 2 E6600 (Conroe / Merom), running Linux.
;; NASM / YASM
;; assemble / link this into a 32 or 64-bit static executable.
global _start
_start:
mov ebp, 100000000
align 32
.loop:
xor ebx,ebx ; avoid partial-flag stall but don't break the eax dependency
%rep 5
adc eax, 0 ; should decode in a 2+1+1+1 pattern
add eax, 0
add eax, 0
add eax, 0
%endrep
dec ebp ; I could have just used SUB here to avoid a partial-flag stall
jg .loop
%ifidn __OUTPUT_FORMAT__, elf32
;; 32-bit sys_exit would work in 64-bit executables on most systems, but not all. Some, notably Window's subsystem for Linux, disable IA32 compat
mov eax,1
xor ebx,ebx
int 0x80 ; sys_exit(0) 32-bit ABI
%else
xor edi,edi
mov eax,231 ; __NR_exit_group from /usr/include/asm/unistd_64.h
syscall ; sys_exit_group(0)
%endif
Linux perf
doesn't work very well on old CPUs like Core 2 (it doesn't know how to access all the events like uops), but it does know how to read the HW counters for cycles and instructions. That's sufficient.
I built and profiled this with
yasm -felf64 -gdwarf2 testloop.asm
ld -o testloop-adc+3xadd-eax,imm=0 testloop.o
# optional: taskset pins it to core 1 to avoid CPU migrations
taskset -c 1 perf stat -e task-clock,context-switches,cycles,instructions ./testloop-adc+3xadd-eax,imm=0
Performance counter stats for './testloop-adc+3xadd-eax,imm=0':
1061.697759 task-clock (msec) # 0.992 CPUs utilized
100 context-switches # 0.094 K/sec
2,545,252,377 cycles # 2.397 GHz
2,301,845,298 instructions # 0.90 insns per cycle
1.069743469 seconds time elapsed
0.9 IPC is the interesting number here.
This is about what we'd expect from static analysis with a 2 uop / 2c latency adc
: (5*(1+3) + 3) = 23
instructions in the loop, 5*(2+3) = 25
cycles of latency = cycles per loop iteration. 23/25 = 0.92.
It's 1.15 on Skylake. (5*(1+3) + 3) / (5*(1+3)) = 1.15
, i.e. the extra .15 is from the xor-zero and dec/jg while the adc/add chain runs at exactly 1 uop per clock, bottlenecked on latency. We'd expect this 1.15 overall IPC on any other uarch with single-cycle latency adc
, too, because the front-end isn't a bottleneck. (In-order Atom and P5 Pentium would be slightly lower, but xor and dec can pair with adc or add on P5.)
On SKL, uops_issued.any
= instructions
= 2.303G, confirming that adc
is single uop (which it always is on SKL, regardless of what value the immediate has). By chance, jg
is the first instruction in a new cache line so it doesn't macro-fuse with dec
on SKL. With dec rbp
or sub ebp,1
instead, uops_issued.any
is the expected 2.2G.
This is extremely repeatable: perf stat -r5
(to run it 5 times and show average + variance), and multiple runs of that, showed the cycle count was repeatable to 1 part in 1000. 1c vs. 2c latency in adc
would make a much bigger difference than that.
Rebuilding the executable with an immediate other than 0
doesn't change the timing at all on Core 2, another strong sign that there's no special case. That's definitely worth testing.
I was initially looking at throughput (with xor eax,eax
before each loop iteration, letting OoO exec overlap iterations), but it was hard to rule out front-end effects. I think I finally did avoid a front-end bottleneck by adding single-uop add
instructions. The throughput-test version of the inner loop looks like this:
xor eax,eax ; break the eax and CF dependency
%rep 5
adc eax, 0 ; should decode in a 2+1+1+1 pattern
add ebx, 0
add ecx, 0
add edx, 0
%endrep
That's why the latency-test version looks kinda weird. But anyway, remember that Core2 doesn't have a decoded-uop cache, and its loop buffer is in the pre-decode stage (after finding instruction boundaries). Only 1 of the 4 decoders can decode multi-uop instructions, so adc
being multi-uop bottlenecks on the front-end. I guess I could have just let that happen, with times 5 adc eax, 0
, since it's unlikely that some later stage of the pipeline would be able to throw out that uop without executing it.
Nehalem's loop buffer recycles decoded uops, and would avoid that decode bottleneck for back-to-back multi-uop instructions.
According to my microbenchmarks, the results of which can be found on uops.info, this optimization was introduced with Sandy Bridge (http://uops.info/html-tp/SNB/ADC_R64_I8-Measurements.html). Westmere does not do this optimization (http://uops.info/html-tp/WSM/ADC_R64_I8-Measurements.html). The data was obtained using a Core i7-2600, and a Core i5-650.
Furthermore, the data on uops.info shows that the optimization is not performed if an 8-bit register is used (Sandy Bridge, Ivy Bridge, Haswell).
这篇关于哪个英特尔微体系结构引入了ADC reg,0单uop特殊情况?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!