本文介绍了为什么 GCC 选择 dword movl 将一个长移位计数复制到 CL?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Computer System: A Programmer's Panthropive的第三章中,在谈到移位操作时给出了一个示例程序:

In the third chapter of Computer System: A Programmer's Prespective, an example program is given when talking about shift operations:

long shift_left4_rightn(long x, long n)
{
    x <<= 4;
    x >>= n;
    return x;
}

其汇编代码如下(可使用 GCC10.2 -O1 重现,用于Godbolt 编译器浏览器上的 x86-64.-O2 以不同的顺序调度指令,但仍使用 movl 到 ECX):

And its assembly code is as follows (reproducible with GCC10.2 -O1 for x86-64 on the Godbolt compiler explorer. -O2 schedules the instructions in a different order but still uses movl to ECX):

shift_left4_rightn:
endbr64
movq %rdi, %rax 获取x
salq $4, %rax x <<= 4
movl %esi, %ecx 获取n
sarq %cl, %rax x >>= n
ret

我想知道为什么得到n的汇编代码是movl %esi, %ecx 而不是movq %rsi, %rcx 因为n是一个四字.

I wonder why the assembly code of getting n is movl %esi, %ecx instead of movq %rsi, %rcx since n is a quad-word.

另一方面,如果考虑优化,movb %sil, %cl 可能更合适,因为移位量只使用单字节寄存器元素%cl 和那些更高的位都被忽略了.

On the other hand, movb %sil, %cl might be more suitable if the optimation is considered, since the shift amount only use the single-byte register element %cl and those higher bits are all ignored.

结果,我实在想不通使用movl %esi, %ecx"的原因;处理长整数时.

As a result, I really fail to figure out the reason for using "movl %esi, %ecx" when dealing with long integer.

推荐答案

是的,GCC 意识到 sar 忽略高位.
那么 movl 是应用两个简单优化规则的自然结果:

Yes, GCC realizes that upper bits are ignored by sar.
Then movl is the natural consequence of applying two simple optimization rules:

  • 避免写入部分寄存器(即 8 位或 16 位,其中写入合并到旧值而不是零扩展).为什么 GCC 不使用部分寄存器? - 由于不同微架构的各种原因,包括在这种情况下对旧值的错误依赖的 RCX.
  • 首选 32 位操作数大小 因为它是 x86-64 机器码中的默认值,不需要任何前缀.对于任何指令,它至少与任何其他操作数大小一样快.
  • Avoid writing partial registers (i.e. 8 or 16-bit, where writing merges into the old value instead of zero-extending). Why doesn't GCC use partial registers? - For various reasons across different microarchitectures, including in this case a false dependency on the old value of RCX.
  • Prefer 32-bit operand size because it's the default in x86-64 machine code, not needing any prefixes. And it's at least as fast as any other operand-size for any instruction.

有趣的事实:即使 arg 是 uint8_t,编译仍然希望使用 movl %esi, %ecx.您可能认为当 arg 值仅在 SIL 中时读取更宽的寄存器可能会导致部分寄存器停顿,但对 x86-64 System V 调用约定的非官方扩展是 呼叫者应将零或符号扩展窄args 至少为 32 位.所以我们可以假设它至少是用 32 位操作编写的.

Fun fact: even if the arg had been uint8_t, compiles would still hopefully use movl %esi, %ecx. You'd think reading a wider register when the arg value is only in SIL could create a partial-register stall, but an unofficial extension to the x86-64 System V calling convention is that callers should zero or sign extend narrow args to at least 32-bit. So we can assume it was written with at least a 32-bit operation.

其他一些选择的具体缺点:

The specific downsides of some other choices:

  • movq %rsi, %rcx - 浪费了 REX 前缀(代码大小的缺点).
  • movb %sil, %cl - 写入部分寄存器,但仍需要 REX 前缀才能访问 SIL.
  • movzbl %sil, %ecx - 代码大小:2 字节操作码,需要 REX 来读取 SIL.此外,AMD CPU 仅对 movl/movq 执行 mov-elimination(零延迟),而不是 movzx.
  • movw %si, %cx - 零优势,需要操作数大小前缀并写入部分寄存器.
  • movzwl %si, %ecx - 与 movq 绑定代码大小,但即使在 Intel CPU 上也无法消除 mov-elimination.
  • movq %rsi, %rcx - waste of a REX prefix (code-size downside).
  • movb %sil, %cl - writes a partial register, and still needs a REX prefix to access SIL.
  • movzbl %sil, %ecx - code size: 2-byte opcode, and needs a REX to read SIL. Also, AMD CPUs only do mov-elimination (zero latency) for movl / movq, not movzx.
  • movw %si, %cx - zero advantages, needs an operand-size prefix and writes a partial register.
  • movzwl %si, %ecx - Tied with movq for code-size, but defeats mov-elimination even on Intel CPUs.

有趣的事实:如果我们用一个虚拟 arg 填充所以 n 到达 RDX,GCC 仍然选择 movl %edx, %ecx,即使 movb %dl, %cl 是相同的代码大小(访问 DL 不需要 REX).所以是的,GCC 肯定会避免字节操作数大小.

Fun fact: if we pad with a dummy arg so n arrives in RDX, GCC still chooses movl %edx, %ecx, even though movb %dl, %cl is the same code-size (no REX needed to access DL). So yes, GCC is definitely avoiding byte operand-size.

有趣的事实 2:不幸的是,Clang 确实在 movq 上浪费了 REX,错过了这个优化.https://godbolt.org/z/6GWhMd

Fun fact 2: Clang unfortunately does waste a REX on movq, missing this optimization. https://godbolt.org/z/6GWhMd

但是如果我们让计数参数 unsigned char,clang 和 GCC 都使用 movl 而不是 movb,幸运的是.https://godbolt.org/z/e95WP8

But if we make the count arg unsigned char, clang and GCC do both use movl instead of movb, fortunately. https://godbolt.org/z/e95WP8

这篇关于为什么 GCC 选择 dword movl 将一个长移位计数复制到 CL?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-17 16:28