问题描述
下面所有的说明做同样的事情:将%EAX
为零。哪种方法是最佳的(需要最少的机器周期)?
xorl%EAX,EAX%
MOV $ 0,%EAX
和L $ 0,%EAX
TL; DR摘要: XOR一样,同样
是所有CPU 。没有其他的方法具有任何优势它,而且它有比任何其它方法中的至少一些优点。它的正式推荐英特尔。在64位模式下,仍然可以使用异或R32,R32
,因为写一个32位的章归零的upper32。 异或R64,R64
是一个字节的浪费,因为它需要一个REX preFIX。
调零向量寄存器通常是最好的 PXOR XMM,XMM
完成。如果你即将与FP指令使用它, xorps XMM,XMM
是有意义的(但不太可能情况下,这只是问题在哪里的开始一个额外的周期新的依存关系链,使任何区别)。它也比 PXOR
较短的编码,由一个字节。在英特尔Nehalem处理器, xorps
需要执行端口5,而 PXOR
可运行于任何端口(0/1/5) 。在SNB家族微架构,异或归零味道都不需要,即使执行端口。在AMD和pre-的Nehalem P6 / 2英特尔, xorps
和 PXOR
的处理方式相同(矢量整数指令)。
使用128B向量指令的AVX版归零了reg上部一样,所以 vpxor XMM,XMM,XMM
是AVX2,AVX512一个不错的选择或任何未来的扩展。
某些CPU识别子一样,同样
作为零成语像 XOR
,但承认任何归零所有CPU成语识别 XOR
。只需使用 XOR
,所以你不必担心它的CPU识别该归零的成语。
XOR
(是一个公认的零成语,不像 MOV章,0
)具有一些明显与一些微妙的优势(摘要列表,然后我会展开这些):
- 小code尺寸比
MOV章,0
。 (所有CPU) - 避免局部寄存器处罚。 (英特尔P6系列和SNB家族)。
- 不使用执行单元,省电和释放执行资源。 (英特尔SNB家族)
- 小UOP(没有直接的数据)离开了UOP缓存线,如果需要的指令就近借阅室。 (英特尔SNB家族)。
- 使用最多的条目。 (英特尔SNB家族(和P4)至少,AMD可能也因为他们使用了类似的设计PRF,而不是保持在ROB像英特尔P6系列微架构寄存器状态。)
小机器code尺寸(2字节而不是5)始终是一个优势:更高的code密度将导致更少的指令缓存未命中,更好的取指令和潜在的德code带宽。
的好处的不使用执行单元作为英特尔SNB家族的微架构XOR是次要的,但节约电能。它更可能在SNB还是IVB,其中只有3个ALU执行端口关系。 Haswell的,后来有4个执行端口,可以处理整数ALU指令,包括 MOV R32,imm32
,所以用由调度完善决策(不付诸实践),HSW仍然可以维持每时钟4微指令,即使他们都需要执行端口。
请参阅了解一些更多的细节另一个问题。
的<一个href=\"https://randomascii.word$p$pss.com/2012/12/29/the-surprising-subtleties-of-zeroing-a-register/\">blog帖子说迈克尔·佩奇链接指出,英特尔的推荐归零成语是 XOR
。该帖子的作者似乎并没有意识到,即使 XOR
在处理的寄存器重命名阶段,不需要执行单元(在未融合零微指令域),它仍然在融合领域一条微。现代的英特尔CPU可以发出和放大器;退休每个时钟周期4个稠域微指令。这就是在每个时钟周期4个零极限从何而来。增加了寄存器重命名硬件的复杂性只是的原因限制了设计的宽度至4之一。
在AMD推土机系列CPU, MOV立即
运行在同一EX0 / EX1整数执行港口为 XOR
。 MOV章,章
还可以在AGU0 / 1上运行,但这只是对寄存器的复制,而不是从立即数设置。所以,据我所知,在AMD公司 XOR
在 MOV
唯一的优势是更短的编码。它可能还可以节省物理寄存器资源,但我还没有看到任何测试。
认可归零成语的避免局部注册处罚在从全寄存器重命名分开部分寄存器英特尔CPU(P6&安培; SNB系列)。
XOR
将代码寄存器为具有上部归零,所以 XOR EAX,EAX
/ INC人
/ INC EAX
避免了通常的局部寄存器罚了pre-IVB处理器有。即使没有 XOR
,IVB只需要合并微指令时的高8位( AH
)的修改,然后整个寄存器被读取,Haswell的甚至删除了。
从瓦格纳雾的microarch指南,第98页(奔腾M部分,由后面的章节包括SNB引用):
pg82 of that guide also confirms that mov reg, 0
is not recognized as a zeroing idiom, at least on early P6 designs like PIII or PM. I'd be very surprised if they spent transistors on detecting it on later CPUs.
xor
sets flags, which means you have to be careful when testing conditions. Since setcc
is unfortunately only available with an 8bit destination, you usually need to take care to avoid partial-register penalties.
It would have been nice if x86-64 repurposed one of the removed opcodes (like AAM) for a 16/32/64 bit setcc r/m
, with the predicate encoded in the source-register 3-bit field of the r/m field (the way some other single-operand instructions use them as opcode bits). But they didn't do that, and that wouldn't help for x86-32 anyway.
Ideally, you should use xor
/ set flags / setcc
/ read full register:
...
call some_func
xor ecx,ecx ; zero *before* the test
test eax,eax
setnz cl ; cl = (some_func() != 0)
add ebx, ecx ; no partial-register penalty here
This has optimal performance on all CPUs (no stalls, merging uops, or false dependencies).
Things are more complicated when you don't want to xor before a flag-setting instruction. e.g. you want to branch on one condition and then setcc on another condition from the same flags. e.g. cmp/jle
, sete
, and you either don't have a spare register, or you want to keep the xor
out of the not-taken code path altogether.
There are no recognized zeroing idioms that don't affect flags, so the best choice depends on the target microarchitecture. On Core2, inserting a merging uop might cause a 2 or 3 cycle stall. It appears to be cheaper on SnB, but I didn't spend much time trying to measure. Using mov reg, 0
/ setcc
would have a significant penalty on older Intel CPUs, and still be somewhat worse on newer Intel.
Using setcc
/ movzx r32, r8
is probably the best alternative for Intel P6 & SnB families, if you can't xor-zero ahead of the flag-setting instruction. That should be better than repeating the test after an xor-zeroing. (Don't even consider sahf
/ lahf
or pushf
/ popf
). IvB can eliminate movzx r32, r8
(i.e. handle it with register-renaming with no execution unit or latency, like xor-zeroing). Haswell and later only eliminate regular mov
instructions, so movzx
takes an execution unit and has non-zero latency, making test/setcc
/movzx
worse than xor
/test/setcc
, but still at least as good as test/mov r,0
/setcc
(and much better on older CPUs).
Using setcc
/ movzx
with no zeroing first is bad on AMD/P4/Silvermont, because they don't track deps separately for sub-registers. There would be a false dep on the old value of the register. Using mov reg, 0
/setcc
for zeroing / dependency-breaking is probably the best alternative when xor
/test/setcc
isn't an option.
Of course, if you don't need setcc
's output to be wider than 8 bits, you don't need to zero anything. However, beware of false dependencies on CPUs other than P6 / SnB if you pick a register that was recently part of a long dependency chain. (And beware of causing a partial reg stall or extra uop if you call a function that might save/restore the register you're using part of.)
and
with an immediate zero isn't special-cased as independent of the old value on any CPUs I'm aware of, so it doesn't break dependency chains. It has no advantages over xor
, and many disadvantages.
See http://agner.org/optimize/ for microarch documentation, including which zeroing idioms are recognized as dependency breaking (e.g. sub same,same
is on some but not all CPUs, while xor same,same
is recognized on all.) mov
does break the dependency chain on the old value of the register (regardless of the source value, zero or not, because that's how mov
works). xor
only breaks dependency chains in the special-case where src and dest are the same register, which is why mov
is left out of the list of specially recognized dependency-breakers. (Also, because it's not recognized as a zeroing idiom, with the other benefits that carries.)
Interestingly, the oldest P6 design (PPro) didn't recognize xor
-zeroing as a dependency-breaker, only as a zeroing idiom for the purposes of avoiding partial-register stalls, so in some cases it was worth using both. (See Agner Fog's Example 6.17. in his microarch pdf. He claims this also applies to P2, P3, and even (early?) PM, but I'm sceptical of that. A comment on the linked blog post says it was only PPro that had this oversight. It seems really unlikely that multiple generations of the P6 family existed without recognizing xor-zeroing as a dep breaker.)
If it really makes your code nicer or saves instructions, then sure, zero with mov
to avoid touching the flags, as long as you don't introduce a performance problem other than code size. Avoiding clobbering flags is the only sensible reason for not using xor
, though.
这篇关于什么是设置在x86汇编寄存器到零的最佳方式:XOR,MOV还是和?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!