本文介绍了什么是设置在x86汇编寄存器到零的最佳方式:XOR,MOV还是和?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面所有的说明做同样的事情:将%EAX 为零。哪种方法是最佳的(需要最少的机器周期)?

  xorl%EAX,EAX%
MOV $ 0,%EAX
和L $ 0,%EAX


解决方案

TL; DR摘要 XOR一样,同样是所有CPU 。没有其他的方法具有任何优势它,而且它有比任何其它方法中的至少一些优点。它的正式推荐英特尔。在64位模式下,仍然可以使用异或R32,R32 ,因为写一个32位的章归零的upper32。 异或R64,R64 是一个字节的浪费,因为它需要一个REX preFIX。

调零向量寄存器通常是最好的 PXOR XMM,XMM 完成。如果你即将与FP指令使用它, xorps XMM,XMM 是有意义的(但不太可能情况下,这只是问题在哪里的开始一个额外的周期新的依存关系链,使任何区别)。它也比 PXOR 较短的编码,由一个字节。在英特尔Nehalem处理器, xorps 需要执行端口5,而 PXOR 可运行于任何端口(0/1/5) 。在SNB家族微架构,异或归零味道都不需要,即使执行端口。在AMD和pre-的Nehalem P6 / 2英特尔, xorps PXOR 的处理方式相同(矢量整数指令)。

使用128B向量指令的AVX版归零了reg上部一样,所以 vpxor XMM,XMM,XMM 是AVX2,AVX512一个不错的选择或任何未来的扩展。


某些CPU识别子一样,同样作为零成语像 XOR ,但承认任何归零所有CPU成语识别 XOR 。只需使用 XOR ,所以你不必担心它的CPU识别该归零的成语。

XOR (是一个公认的零成语,不像 MOV章,0 )具有一些明显与一些微妙的优势(摘要列表,然后我会展开这些):


  • 小code尺寸比 MOV章,0 。 (所有CPU)

  • 避免局部寄存器处罚。 (英特尔P6系列和SNB家族)。

  • 不使用执行单元,省电和释放执行资源。 (英特尔SNB家族)

  • 小UOP(没有直接的数据)离开了UOP缓存线,如果需要的指令就近借阅室。 (英特尔SNB家族)。

  • 使用最多的条目。 (英特尔SNB家族(和P4)至少,AMD可能也因为他们使用了类似的设计PRF,而不是保持在ROB像英特尔P6系列微架构寄存器状态。)


小机器code尺寸(2字节而不是5)始终是一个优势:更高的code密度将导致更少的指令缓存未命中,更好的取指令和潜在的德code带宽。


的好处的不使用执行单元作为英特尔SNB家族的微架构XOR是次要的,但节约电能。它更可能在SNB还是IVB,其中只有3个ALU执行端口关系。 Haswell的,后来有4个执行端口,可以处理整数ALU指令,包括 MOV R32,imm32 ,所以用由调度完善决策(不付诸实践),HSW仍然可以维持每时钟4微指令,即使他们都需要执行端口。

请参阅了解一些更多的细节另一个问题。

的<一个href=\"https://randomascii.word$p$pss.com/2012/12/29/the-surprising-subtleties-of-zeroing-a-register/\">blog帖子说迈克尔·佩奇链接指出,英特尔的推荐归零成语是 XOR 。该帖子的作者似乎并没有意识到,即使 XOR 在处理的寄存器重命名阶段,不需要执行单元(在未融合零微指令域),它仍然在融合领域一条微。现代的英特尔CPU可以发出和放大器;退休每个时钟周期4个稠域微指令。这就是在每个时钟周期4个零极限从何而来。增加了寄存器重命名硬件的复杂性只是的原因限制了设计的宽度至4之一。

在AMD推土机系列CPU, MOV立即运行在同一EX0 / EX1整数执行港口为 XOR MOV章,章还可以在AGU0 / 1上运行,但这只是对寄存器的复制,而不是从立即数设置。所以,据我所知,在AMD公司 XOR MOV 唯一的优势是更短的编码。它可能还可以节省物理寄存器资源,但我还没有看到任何测试。


认可归零成语的避免局部注册处罚在从全寄存器重命名分开部分寄存​​器英特尔CPU(P6&安培; SNB系列)。

XOR 代码寄存器为具有上部归零,所以 XOR EAX,EAX / INC人 / INC EAX 避免了通常的局部寄存器罚了pre-IVB处理器有。即使没有 XOR ,IVB只需要合并微指令时的高8位( AH )的修改,然后整个寄存器被读取,Haswell的甚至删除了。

从瓦格纳雾的microarch指南,第98页(奔腾M部分,由后面的章节包括SNB引用):

pg82 of that guide also confirms that mov reg, 0 is not recognized as a zeroing idiom, at least on early P6 designs like PIII or PM. I'd be very surprised if they spent transistors on detecting it on later CPUs.


xor sets flags, which means you have to be careful when testing conditions. Since setcc is unfortunately only available with an 8bit destination, you usually need to take care to avoid partial-register penalties.

It would have been nice if x86-64 repurposed one of the removed opcodes (like AAM) for a 16/32/64 bit setcc r/m, with the predicate encoded in the source-register 3-bit field of the r/m field (the way some other single-operand instructions use them as opcode bits). But they didn't do that, and that wouldn't help for x86-32 anyway.

Ideally, you should use xor / set flags / setcc / read full register:

...
call  some_func
xor     ecx,ecx    ; zero *before* the test
test    eax,eax
setnz   cl         ; cl = (some_func() != 0)
add     ebx, ecx   ; no partial-register penalty here

This has optimal performance on all CPUs (no stalls, merging uops, or false dependencies).

Things are more complicated when you don't want to xor before a flag-setting instruction. e.g. you want to branch on one condition and then setcc on another condition from the same flags. e.g. cmp/jle, sete, and you either don't have a spare register, or you want to keep the xor out of the not-taken code path altogether.

There are no recognized zeroing idioms that don't affect flags, so the best choice depends on the target microarchitecture. On Core2, inserting a merging uop might cause a 2 or 3 cycle stall. It appears to be cheaper on SnB, but I didn't spend much time trying to measure. Using mov reg, 0 / setcc would have a significant penalty on older Intel CPUs, and still be somewhat worse on newer Intel.

Using setcc / movzx r32, r8 is probably the best alternative for Intel P6 & SnB families, if you can't xor-zero ahead of the flag-setting instruction. That should be better than repeating the test after an xor-zeroing. (Don't even consider sahf / lahf or pushf / popf). IvB can eliminate movzx r32, r8 (i.e. handle it with register-renaming with no execution unit or latency, like xor-zeroing). Haswell and later only eliminate regular mov instructions, so movzx takes an execution unit and has non-zero latency, making test/setcc/movzx worse than xor/test/setcc, but still at least as good as test/mov r,0/setcc (and much better on older CPUs).

Using setcc / movzx with no zeroing first is bad on AMD/P4/Silvermont, because they don't track deps separately for sub-registers. There would be a false dep on the old value of the register. Using mov reg, 0/setcc for zeroing / dependency-breaking is probably the best alternative when xor/test/setcc isn't an option.

Of course, if you don't need setcc's output to be wider than 8 bits, you don't need to zero anything. However, beware of false dependencies on CPUs other than P6 / SnB if you pick a register that was recently part of a long dependency chain. (And beware of causing a partial reg stall or extra uop if you call a function that might save/restore the register you're using part of.)


and with an immediate zero isn't special-cased as independent of the old value on any CPUs I'm aware of, so it doesn't break dependency chains. It has no advantages over xor, and many disadvantages.

See http://agner.org/optimize/ for microarch documentation, including which zeroing idioms are recognized as dependency breaking (e.g. sub same,same is on some but not all CPUs, while xor same,same is recognized on all.) mov does break the dependency chain on the old value of the register (regardless of the source value, zero or not, because that's how mov works). xor only breaks dependency chains in the special-case where src and dest are the same register, which is why mov is left out of the list of specially recognized dependency-breakers. (Also, because it's not recognized as a zeroing idiom, with the other benefits that carries.)

Interestingly, the oldest P6 design (PPro) didn't recognize xor-zeroing as a dependency-breaker, only as a zeroing idiom for the purposes of avoiding partial-register stalls, so in some cases it was worth using both. (See Agner Fog's Example 6.17. in his microarch pdf. He claims this also applies to P2, P3, and even (early?) PM, but I'm sceptical of that. A comment on the linked blog post says it was only PPro that had this oversight. It seems really unlikely that multiple generations of the P6 family existed without recognizing xor-zeroing as a dep breaker.)


If it really makes your code nicer or saves instructions, then sure, zero with mov to avoid touching the flags, as long as you don't introduce a performance problem other than code size. Avoiding clobbering flags is the only sensible reason for not using xor, though.

这篇关于什么是设置在x86汇编寄存器到零的最佳方式:XOR,MOV还是和?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-17 15:51