问题描述
在Computer System: A Programmer's Panthropive的第三章中,在谈到移位操作时给出了一个示例程序:
In the third chapter of Computer System: A Programmer's Prespective, an example program is given when talking about shift operations:
long shift_left4_rightn(long x, long n)
{
x <<= 4;
x >>= n;
return x;
}
其汇编代码如下(可使用 GCC10.2 -O1
重现,用于Godbolt 编译器浏览器上的 x86-64.-O2
以不同的顺序调度指令,但仍使用 movl
到 ECX):
And its assembly code is as follows (reproducible with GCC10.2 -O1
for x86-64 on the Godbolt compiler explorer. -O2
schedules the instructions in a different order but still uses movl
to ECX):
shift_left4_rightn:
endbr64
movq %rdi, %rax 获取x
salq $4, %rax x <<= 4
movl %esi, %ecx 获取n
sarq %cl, %rax x >>= n
ret
我想知道为什么得到n的汇编代码是movl %esi, %ecx
而不是movq %rsi, %rcx
因为n
是一个四字.
I wonder why the assembly code of getting n is movl %esi, %ecx
instead of movq %rsi, %rcx
since n
is a quad-word.
另一方面,如果考虑优化,movb %sil, %cl
可能更合适,因为移位量只使用单字节寄存器元素%cl
和那些更高的位都被忽略了.
On the other hand, movb %sil, %cl
might be more suitable if the optimation is considered, since the shift amount only use the single-byte register element %cl
and those higher bits are all ignored.
结果,我实在想不通使用movl %esi, %ecx"的原因;处理长整数时.
As a result, I really fail to figure out the reason for using "movl %esi, %ecx" when dealing with long integer.
推荐答案
是的,GCC 意识到 sar
忽略高位.
那么 movl
是应用两个简单优化规则的自然结果:
Yes, GCC realizes that upper bits are ignored by sar
.
Then movl
is the natural consequence of applying two simple optimization rules:
- 避免写入部分寄存器(即 8 位或 16 位,其中写入合并到旧值而不是零扩展).为什么 GCC 不使用部分寄存器? - 由于不同微架构的各种原因,包括在这种情况下对旧值的错误依赖的 RCX.
- 首选 32 位操作数大小 因为它是 x86-64 机器码中的默认值,不需要任何前缀.对于任何指令,它至少与任何其他操作数大小一样快.
- Avoid writing partial registers (i.e. 8 or 16-bit, where writing merges into the old value instead of zero-extending). Why doesn't GCC use partial registers? - For various reasons across different microarchitectures, including in this case a false dependency on the old value of RCX.
- Prefer 32-bit operand size because it's the default in x86-64 machine code, not needing any prefixes. And it's at least as fast as any other operand-size for any instruction.
有趣的事实:即使 arg 是 uint8_t
,编译仍然希望使用 movl %esi, %ecx
.您可能认为当 arg 值仅在 SIL 中时读取更宽的寄存器可能会导致部分寄存器停顿,但对 x86-64 System V 调用约定的非官方扩展是 呼叫者应将零或符号扩展窄args 至少为 32 位.所以我们可以假设它至少是用 32 位操作编写的.
Fun fact: even if the arg had been uint8_t
, compiles would still hopefully use movl %esi, %ecx
. You'd think reading a wider register when the arg value is only in SIL could create a partial-register stall, but an unofficial extension to the x86-64 System V calling convention is that callers should zero or sign extend narrow args to at least 32-bit. So we can assume it was written with at least a 32-bit operation.
其他一些选择的具体缺点:
The specific downsides of some other choices:
movq %rsi, %rcx
- 浪费了 REX 前缀(代码大小的缺点).movb %sil, %cl
- 写入部分寄存器,但仍需要 REX 前缀才能访问 SIL.movzbl %sil, %ecx
- 代码大小:2 字节操作码,需要 REX 来读取 SIL.此外,AMD CPU 仅对movl
/movq
执行 mov-elimination(零延迟),而不是 movzx.movw %si, %cx
- 零优势,需要操作数大小前缀并写入部分寄存器.movzwl %si, %ecx
- 与movq
绑定代码大小,但即使在 Intel CPU 上也无法消除 mov-elimination.
movq %rsi, %rcx
- waste of a REX prefix (code-size downside).movb %sil, %cl
- writes a partial register, and still needs a REX prefix to access SIL.movzbl %sil, %ecx
- code size: 2-byte opcode, and needs a REX to read SIL. Also, AMD CPUs only do mov-elimination (zero latency) formovl
/movq
, not movzx.movw %si, %cx
- zero advantages, needs an operand-size prefix and writes a partial register.movzwl %si, %ecx
- Tied withmovq
for code-size, but defeats mov-elimination even on Intel CPUs.
有趣的事实:如果我们用一个虚拟 arg 填充所以 n
到达 RDX,GCC 仍然选择 movl %edx, %ecx
,即使 movb %dl, %cl
是相同的代码大小(访问 DL 不需要 REX).所以是的,GCC 肯定会避免字节操作数大小.
Fun fact: if we pad with a dummy arg so n
arrives in RDX, GCC still chooses movl %edx, %ecx
, even though movb %dl, %cl
is the same code-size (no REX needed to access DL). So yes, GCC is definitely avoiding byte operand-size.
有趣的事实 2:不幸的是,Clang 确实在 movq
上浪费了 REX,错过了这个优化.https://godbolt.org/z/6GWhMd
Fun fact 2: Clang unfortunately does waste a REX on movq
, missing this optimization. https://godbolt.org/z/6GWhMd
但是如果我们让计数参数 unsigned char
,clang 和 GCC 都使用 movl
而不是 movb
,幸运的是.https://godbolt.org/z/e95WP8
But if we make the count arg unsigned char
, clang and GCC do both use movl
instead of movb
, fortunately. https://godbolt.org/z/e95WP8
这篇关于为什么 GCC 选择 dword movl 将一个长移位计数复制到 CL?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!