本文介绍了在长模式下使用64/32位寄存器时,可能会有任何惩罚吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

可能这不仅涉及微优化,而且涉及纳米优化,但是这个主题令我感兴趣,我想知道在长模式下使用非本地寄存器大小时是否会有任何惩罚?

Probably this is all about not even micro- but nanooptimizations, but the subject interests me and I would like to know if there are any penalties when using non-native register sizes in long mode?

我从各种来源了解到,部分寄存器更新(例如ax而不是eax)会导致eflags停顿并降低性能.但是我不确定长模式.对于此处理器操作模式,什么寄存器大小被认为是本机的? x86-64仍然是x86体系结构的扩展,因此我相信32位仍然是本地的.还是我错了?

I've learned from various sources, that partial register updates (like ax instead of eax) can cause eflags stall and degrade performance. But I'm not sure about the long mode. What register size is considered native for this processor operation mode? x86-64 are still extensions to x86 architecture, thus I believe 32 bits are still native. Or am I wrong?

例如,类似

sub eax, r14d

sub rax, r14

具有相同的大小,但是使用其中任何一个都会受到罚款吗?在如下所示的连续指令中混合寄存器大小时,可能会有任何惩罚吗? (假设在所有情况下,高dword均为零)

have the same size, but may there be any penalties when using either of those?May there be any penalties when mixing register sizes in consecutive instructions like the below? (assuming high dword is zero in all cases)

sub ecx, eax
sub r14, rax

推荐答案

否,写入32位寄存器总是零扩展到完整寄存器,因此x86-64避免了32位和64位指令的部分寄存器惩罚.

No, writing to a 32-bit register always zero-extends to the full register, so x86-64 avoids any partial-register penalties for 32 and 64-bit instruction.

是的,大多数指令的默认操作数大小为32位(. (Core2/Nehalem的停顿时间比早期CPU少,但是在插入合并的uop时仍会停顿2或3c.Sandybridge在插入合并的uop时完全不会停顿.)

Zeroing EAX ahead of the flag-setting and setcc with xor eax,eax avoids the partial-register penalty entirely. (Core2/Nehalem stalls for fewer cycles than earlier CPUs, but does still stall for 2 or 3c while inserting a merging uop. Sandybridge doesn't stall at all while inserting the merging uop).

(不同CPU上部分寄存器罚款的另一摘要:为什么不使用GCC部分寄存器?,说的基本上是一样的东西.

(Another summary of partial register penalties on different CPUs: Why doesn't GCC use partial registers?, saying basically the same thing).

AMD在以后读取完整寄存器时不会遭受部分寄存器停顿的困扰,但是部分寄存器的写入和读取却对完整寄存器有错误的依赖性. (AMD CPU首先不会单独重命名子寄存器.IntelP4和Silvermont/Knight's Landing的使用方法相同.)

AMD doesn't suffer from partial-register stalls when reading the full register later, but instead partial-register writes and reads have a false dependency on the full register. (AMD CPUs don't rename sub-registers separately in the first place. Intel P4 and Silvermont / Knight's Landing are the same way.)

英特尔Haswell/Skylake(也许还有Ivybridge)根本没有将alrax分别重命名,因此它们不需要合并low8/low16寄存器.但是setcc al对旧值有错误的依赖性.它们仍会重命名并合并ah. ( 详细信息HSW/SKL部分注册性能 .)

Intel Haswell/Skylake (and maybe Ivybridge) don't rename al separately from rax at all, so they never need to merge low8 / low16 registers. But the setcc al has a false dependency on the old value. They do still rename and merge ah. (Details on HSW/SKL partial-reg performance.)

# partial flag stall when reading a flag that didn't come from
# the last instruction to write any flags.
clc
# edi and esi = one-past-the-end of dst and src
# ecx = -count
bigInt_add:
    mov   eax, [esi+ecx*4]
    adc   [edi+ecx*4], eax   # reads CF, partial flag stall on 2nd and later iterations
    inc   ecx                # writes all flags except CF
    jl    bitInt_add         # loop upwards towards zero

请参见此问题与解答; A 进一步讨论Intel Sandybridge与Sandybridge之前有关部分标志的问题.

See this Q&A for more discussion about partial-flags issues on Intel pre-Sandybridge vs. Sandybridge.

另请参见 Agner Fog的microarch pdf ,以及 x86 标记Wiki的问题,以获取有关所有这些的更多详细信息.

See also Agner Fog's microarch pdf, and other links in the x86 tag wiki for more details about all of this.

这篇关于在长模式下使用64/32位寄存器时,可能会有任何惩罚吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-17 15:17