问题描述
Haswell 及更早版本上的 ADC 通常为 2 uop,具有 2 个周期延迟,因为 Intel uop 传统上只能有 2 个输入(https://agner.org/optimize/).在 Haswell 为 FMA 和 索引寻址模式的微融合在某些情况下.
ADC on Haswell and earlier is normally 2 uops, with 2 cycle latency, because Intel uops traditionally could only have 2 inputs (https://agner.org/optimize/). Broadwell / Skylake and later have single-uop ADC/SBB/CMOV, after Haswell introduced 3-input uops for FMA and micro-fusion of indexed addressing modes in some cases.
(但 BDW/SKL 仍然使用 2 uops 用于 adc al, imm8
短格式编码,或其他 al/ax/eax/rax, imm8/16/32/32 短格式没有 ModRM.我的回答中有更多细节.)
(But BDW/SKL still uses 2 uops for the adc al, imm8
short-form encoding, or the other al/ax/eax/rax, imm8/16/32/32 short forms with no ModRM. More details in my answer.)
但是带有立即数 0 的 adc
在 Haswell 上是特殊情况,只能解码为单个 uop. @BeeOnRope 对此进行了测试,并包括对此性能怪癖 在他的 uarch-bench 中:https://github.com/travisdowns/uarch-bench.Haswell 服务器上 CI 的示例输出,显示了 adc reg,0
和 adc reg,1
或 adc reg,zeroed-reg
之间的区别.
But adc
with immediate 0 is special-cased on Haswell to decode as only a single uop. @BeeOnRope tested this, and included a check for this performance quirk in his uarch-bench: https://github.com/travisdowns/uarch-bench. Sample output from CI on a Haswell server showing a difference between adc reg,0
and adc reg,1
or adc reg,zeroed-reg
.
(但仅适用于 32 或 64 位操作数大小,而不适用于 adc bl,0
.因此请使用 32 位 当在 setcc 结果上使用 adc 时 将 2 个条件合并为一个分支.)
(But only for 32 or 64-bit operand-size, not adc bl,0
. So use 32-bit when using adc on a setcc result to combine 2 conditions into one branch.)
SBB 也一样.就我所见,对于具有相同立即数的等效编码,ADC 和 SBB 性能在任何 CPU 上都没有任何区别.
Same for SBB. As far as I've seen, there's never any difference between ADC and SBB performance on any CPU, for the equivalent encoding with the same immediate value.
这种针对 imm=0
的优化是什么时候引入的?
When was this optimization for imm=0
introduced?
我在 Core 2 上测试,发现 adc eax,0
延迟是 2 个周期,与 adc eax,3
相同.并且对于0
与3
的吞吐量测试的一些变体而言,循环计数是相同的,因此第一代 Core 2(Conroe/Merom)不会这样做优化.
I tested on Core 2, and found that adc eax,0
latency is 2 cycles, same as adc eax,3
. And also the cycle count is identical for a few variations of throughput tests with 0
vs. 3
, so first-gen Core 2 (Conroe/Merom) doesn't do this optimization.
回答这个问题的最简单方法可能是在 Sandybridge 系统上使用我下面的测试程序,看看 adc eax,0
是否比 adc eax,1
快.但基于可靠文档的答案也可以.
The easiest way to answer this is probably to use my test program below on a Sandybridge system, and see if adc eax,0
is faster than adc eax,1
. But answers based on reliable documentation would be fine, too.
脚注 1:我在运行 Linux 的 Core 2 E6600 (Conroe/Merom) 上使用了这个测试程序.
Footnote 1: I used this test program on my Core 2 E6600 (Conroe / Merom), running Linux.
;; NASM / YASM
;; assemble / link this into a 32 or 64-bit static executable.
global _start
_start:
mov ebp, 100000000
align 32
.loop:
xor ebx,ebx ; avoid partial-flag stall but don't break the eax dependency
%rep 5
adc eax, 0 ; should decode in a 2+1+1+1 pattern
add eax, 0
add eax, 0
add eax, 0
%endrep
dec ebp ; I could have just used SUB here to avoid a partial-flag stall
jg .loop
%ifidn __OUTPUT_FORMAT__, elf32
;; 32-bit sys_exit would work in 64-bit executables on most systems, but not all. Some, notably Window's subsystem for Linux, disable IA32 compat
mov eax,1
xor ebx,ebx
int 0x80 ; sys_exit(0) 32-bit ABI
%else
xor edi,edi
mov eax,231 ; __NR_exit_group from /usr/include/asm/unistd_64.h
syscall ; sys_exit_group(0)
%endif
Linux perf
在像 Core 2 这样的旧 CPU 上不能很好地工作(它不知道如何访问所有事件,如 uops),但它知道如何读取硬件计数器用于循环和指令.够了.
Linux perf
doesn't work very well on old CPUs like Core 2 (it doesn't know how to access all the events like uops), but it does know how to read the HW counters for cycles and instructions. That's sufficient.
我用
yasm -felf64 -gdwarf2 testloop.asm
ld -o testloop-adc+3xadd-eax,imm=0 testloop.o
# optional: taskset pins it to core 1 to avoid CPU migrations
taskset -c 1 perf stat -e task-clock,context-switches,cycles,instructions ./testloop-adc+3xadd-eax,imm=0
Performance counter stats for './testloop-adc+3xadd-eax,imm=0':
1061.697759 task-clock (msec) # 0.992 CPUs utilized
100 context-switches # 0.094 K/sec
2,545,252,377 cycles # 2.397 GHz
2,301,845,298 instructions # 0.90 insns per cycle
1.069743469 seconds time elapsed
0.9 IPC 是这里有趣的数字.
这就是我们对 2 uop/2c 延迟的静态分析的期望 adc
:(5*(1+3) + 3) = 23
循环中的指令,5*(2+3) = 25
延迟周期 = 每次循环迭代的周期.23/25 = 0.92.
This is about what we'd expect from static analysis with a 2 uop / 2c latency adc
: (5*(1+3) + 3) = 23
instructions in the loop, 5*(2+3) = 25
cycles of latency = cycles per loop iteration. 23/25 = 0.92.
Skylake 是 1.15.(5*(1+3) + 3)/(5*(1+3)) = 1.15
,即额外的 .15 来自异或零和 dec/jg 而 adc/add 链以每时钟 1 uop 的速度运行,在延迟方面存在瓶颈.我们也希望在任何其他具有单周期延迟 adc
的 uarch 上实现 1.15 的整体 IPC,因为前端不是瓶颈.(有序 Atom 和 P5 Pentium 会略低,但 xor 和 dec 可以与 adc 配对或添加到 P5.)
It's 1.15 on Skylake. (5*(1+3) + 3) / (5*(1+3)) = 1.15
, i.e. the extra .15 is from the xor-zero and dec/jg while the adc/add chain runs at exactly 1 uop per clock, bottlenecked on latency. We'd expect this 1.15 overall IPC on any other uarch with single-cycle latency adc
, too, because the front-end isn't a bottleneck. (In-order Atom and P5 Pentium would be slightly lower, but xor and dec can pair with adc or add on P5.)
在 SKL,uops_issued.any
= instructions
= 2.303G,确认 adc
是单 uop(它总是在 SKL,无论立即数具有什么价值).偶然地,jg
是新缓存行中的第一条指令,因此它不会与 SKL 上的 dec
进行宏融合.使用 dec rbp
或 sub ebp,1
代替,uops_issued.any
是预期的 2.2G.
On SKL, uops_issued.any
= instructions
= 2.303G, confirming that adc
is single uop (which it always is on SKL, regardless of what value the immediate has). By chance, jg
is the first instruction in a new cache line so it doesn't macro-fuse with dec
on SKL. With dec rbp
or sub ebp,1
instead, uops_issued.any
is the expected 2.2G.
这是极其可重复的:perf stat -r5
(运行 5 次并显示平均值 + 方差),多次运行显示循环计数可重复到千分之一.adc
中的 1c 与 2c 延迟会产生比这更大的差异.
This is extremely repeatable: perf stat -r5
(to run it 5 times and show average + variance), and multiple runs of that, showed the cycle count was repeatable to 1 part in 1000. 1c vs. 2c latency in adc
would make a much bigger difference than that.
使用除 0
以外的立即数重建可执行文件不会根本改变 Core 2 上的时间,这是没有特殊情况的另一个强烈迹象.这绝对值得测试.
Rebuilding the executable with an immediate other than 0
doesn't change the timing at all on Core 2, another strong sign that there's no special case. That's definitely worth testing.
我最初关注的是吞吐量(在每次循环迭代之前使用 xor eax,eax
,让 OoO exec 重叠迭代),但很难排除前端影响.我想我终于确实通过添加单 uop add
指令避免了前端瓶颈.内循环的吞吐量测试版本如下所示:
I was initially looking at throughput (with xor eax,eax
before each loop iteration, letting OoO exec overlap iterations), but it was hard to rule out front-end effects. I think I finally did avoid a front-end bottleneck by adding single-uop add
instructions. The throughput-test version of the inner loop looks like this:
xor eax,eax ; break the eax and CF dependency
%rep 5
adc eax, 0 ; should decode in a 2+1+1+1 pattern
add ebx, 0
add ecx, 0
add edx, 0
%endrep
这就是延迟测试版本看起来有点奇怪的原因.但无论如何,请记住 Core2 没有解码的 uop 缓存,它的循环缓冲区处于预解码阶段(在找到指令边界之后).4 个解码器中只有 1 个可以解码多 uop 指令,因此 adc
是前端的多 uop 瓶颈.我想我可以让这种情况发生,使用 times 5 adc eax, 0
,因为管道的某个后期阶段不太可能在不执行它的情况下抛出该 uop.
That's why the latency-test version looks kinda weird. But anyway, remember that Core2 doesn't have a decoded-uop cache, and its loop buffer is in the pre-decode stage (after finding instruction boundaries). Only 1 of the 4 decoders can decode multi-uop instructions, so adc
being multi-uop bottlenecks on the front-end. I guess I could have just let that happen, with times 5 adc eax, 0
, since it's unlikely that some later stage of the pipeline would be able to throw out that uop without executing it.
Nehalem 的循环缓冲区回收解码的 uop,并避免背靠背多 uop 指令的解码瓶颈.
Nehalem's loop buffer recycles decoded uops, and would avoid that decode bottleneck for back-to-back multi-uop instructions.
推荐答案
根据我的微基准测试,结果可以在 uops.info,此优化是在 Sandy Bridge 中引入的 (https://www.uops.info/html-tp/SNB/ADC_R64_0-Measurements.html).Westmere 没有做这个优化(https://uops.info/html-tp/WSM/ADC_R64_0-Measurements.html).数据是使用酷睿 i7-2600 和酷睿 i5-650 获得的.
According to my microbenchmarks, the results of which can be found on uops.info, this optimization was introduced with Sandy Bridge (https://www.uops.info/html-tp/SNB/ADC_R64_0-Measurements.html). Westmere does not do this optimization (https://uops.info/html-tp/WSM/ADC_R64_0-Measurements.html). The data was obtained using a Core i7-2600, and a Core i5-650.
此外,uops.info 上的数据表明,如果 8 位使用注册(Sandy Bridge, 常春藤桥,Haswell).
Furthermore, the data on uops.info shows that the optimization is not performed if an 8-bit register is used (Sandy Bridge, Ivy Bridge, Haswell).
这篇关于哪个英特尔微架构引入了 ADC reg,0 single-uop 特殊情况?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!