问题描述
总结:我正在查看汇编代码来指导我的优化,并在将 int32 添加到指针时看到了很多符号或零扩展.
Summary: I was looking at assembly code to guide my optimizations and see lots of sign or zero extensions when adding int32 to a pointer.
void Test(int *out, int offset)
{
out[offset] = 1;
}
-------------------------------------
movslq %esi, %rsi
movl $1, (%rdi,%rsi,4)
ret
起初,我认为我的编译器在将 32 位整数添加到 64 位整数方面遇到了挑战,但我已经通过英特尔 ICC 11、ICC 14 和 GCC 5.3 确认了这种行为.
At first, I thought my compiler was challenged at adding 32bit to 64bit integers, but I've confirmed this behavior with Intel ICC 11, ICC 14, and GCC 5.3.
这个线程确认我的发现,但不清楚是否需要符号或零扩展.仅当尚未设置高 32 位时,才需要此符号/零扩展.但是 x86-64 ABI 不是足够聪明以要求这样做吗?
This thread confirms my findings, but it's not clear if the sign or zero extension is necessary. This sign/zero extension would only be necessary if the upper 32bits aren't already set. But wouldn't the x86-64 ABI be smart enough to require that?
我不太愿意将所有指针偏移量更改为 ssize_t,因为寄存器溢出会增加代码的缓存占用空间.
I'm kind of reluctant to change all my pointer offsets to ssize_t because register spills will increase the cache footprint of the code.
推荐答案
是的,您必须假设 arg 或返回值寄存器的高 32 位包含垃圾.另一方面,您可以在自己打电话或返回时将垃圾留在高位 32 中.即接收方的负担是忽略高位,而不是传递方清理高位.
Yes, you have to assume that the high 32 bits of an arg or return-value register contains garbage. On the flip side, you are allowed to leave garbage in the high 32 when calling or returning yourself. i.e. the burden is on the receiving side to ignore the high bits, not on the passing side to clean the high bits.
您需要对 64 位进行符号或零扩展才能使用 64 位有效地址中的值.在 x32 ABI 中,gcc 经常使用 32 位有效地址而不是使用 64 位操作数 -修改用作数组索引的潜在负整数的每条指令的大小.
You need to sign or zero extend to 64 bits to use the value in a 64-bit effective address. In the x32 ABI, gcc frequently uses 32-bit effective addresses instead of using 64-bit operand-size for every instruction modifying a potentially-negative integer used as an array index.
x86-64 SysV ABI 只说明了寄存器的哪些部分是_Bool
(又名 bool
)归零.第 20 页:
The x86-64 SysV ABI only says anything about which parts of a register are zeroed for _Bool
(aka bool
). Page 20:
当 _Bool
类型的值被返回或传入寄存器或堆栈中,第 0 位包含真值,第 1 至 7 位应为零(脚注 14:其他位未指定,因此当截断为 8 位时,这些值的消费者可以依赖它是 0 或 1)
此外,关于 %al
的东西保存了可变参数函数的 FP 寄存器参数的数量,而不是整个 %rax
.
Also, the stuff about %al
holding the number of FP register args for varargs functions, not the whole %rax
.
有一个 open github issue 关于 x32 和 x86-64 ABI 文档的 github 页面.
There's an open github issue about this exact question on the github page for the x32 and x86-64 ABI documents.
ABI 没有对保存 args 或返回值的整数或向量寄存器的高位部分的内容提出任何进一步的要求或保证,因此没有任何要求或保证.我通过 Michael Matz(ABI 维护者之一)的电子邮件确认了这一事实:通常,如果 ABI 没有说明某些内容,您就不能依赖它."
The ABI doesn't place any further requirements or guarantees on the contents of the high parts of integer or vector registers holding args or return values, so there aren't any. I have confirmation of this fact via email from Michael Matz (one of the ABI maintainers): "Generally, if the ABI doesn't say something is specified, you cannot rely on it."
他还确认了例如clang >= 3.6 使用 addps
可能会减慢在高元素中使用垃圾降低或引发额外的 FP 异常是一个错误(这提醒我应该报告).他补充说,这曾经是 AMD 实现 glibc 数学函数的一个问题.当传递标量 double
或 float
args 时,普通的 C 代码可以在向量 regs 的高元素中留下垃圾.
He also confirmed that e.g. clang >= 3.6's use of an addps
that could slow down or raise extra FP exceptions with garbage in high elements is a bug (which reminds me I should report that). He adds that this was an issue once with an AMD implementation of a glibc math function. Normal C code can leave garbage in high elements of vector regs when passing scalar double
or float
args.
窄函数参数,甚至 _Bool
/bool
,都是符号或零扩展到 32 位.clang 甚至制作了依赖于这种行为的代码(显然是 2007 年以来).ICC17 不做,所以ICC 和 clang 与 ABI 不兼容,即使对于 C.不要从 x86-64 SysV ABI 的 ICC 编译代码调用 clang 编译函数,如果前 6 个整数参数中的任何一个比32 位.
Narrow function arguments, even _Bool
/bool
, are sign or zero-extended to 32 bits. clang even makes code that depends on this behaviour (since 2007, apparently). ICC17 doesn't do it, so ICC and clang are not ABI-compatible, even for C. Don't call clang-compiled functions from ICC-compiled code for the x86-64 SysV ABI, if any of the first 6 integer args are narrower than 32-bit.
这不适用于返回值,仅适用于 args:gcc 和 clang 都假定它们接收的返回值仅具有类型宽度以内的有效数据.例如,gcc 将使返回 char
的函数在 %eax
的高 24 位中留下垃圾.
This doesn't apply to return values, only args: gcc and clang both assume that return-values they receive only have valid data up to the width of the type. gcc will make functions returning char
that leave garbage in the high 24 bits of %eax
, for example.
ABI 讨论组上的 近期主题 是一项提案,旨在澄清将 8 位和 16 位参数扩展到 32 位的规则,并且可能实际上修改 ABI 以要求这样做.主要的编译器(ICC 除外)已经这样做了,但这将改变调用者和被调用者之间的契约.
A recent thread on the ABI discussion group was a proposal to clarify the rules for extending 8 and 16-bit args to 32 bits, and maybe actually modify the ABI to require this. The major compilers (except ICC) already do it, but it would be a change to the contract between callers and callees.
这是一个例子(用其他编译器检查或调整代码 在 Godbolt 编译器资源管理器上,我在其中包含了许多简单的示例,这些示例仅演示了其中的一部分,以及演示了很多内容的示例):
Here's an example (check it out with other compilers or tweak the code on the Godbolt Compiler Explorer, where I've included many simple examples that only demonstrate one piece of the puzzle, as well as this that demonstrates a lot):
extern short fshort(short a);
extern unsigned fuint(unsigned int a);
extern unsigned short array_us[];
unsigned short lookupu(unsigned short a) {
unsigned int a_int = a + 1234;
a_int += fshort(a); // NOTE: not the same calls as the signed lookup
return array_us[a + fuint(a_int)];
}
# clang-3.8 -O3 for x86-64. arg in %rdi. (Actually in %di, zero-extended to %edi by our caller)
lookupu(unsigned short):
pushq %rbx # save a call-preserved reg for out own use. (Also aligns the stack for another call)
movl %edi, %ebx # If we didn't assume our arg was already zero-extended, this would be a movzwl (aka movzx)
movswl %bx, %edi # sign-extend to call a function that takes signed short instead of unsigned short.
callq fshort(short)
cwtl # Don't trust the upper bits of the return value. (This is cdqe, Intel syntax. eax = sign_extend(ax))
leal 1234(%rbx,%rax), %edi # this is the point where we'd get a wrong answer if our arg wasn't zero-extended. gcc doesn't assume this, but clang does.
callq fuint(unsigned int)
addl %ebx, %eax # zero-extends eax to 64bits
movzwl array_us(%rax,%rax), %eax # This zero-extension (instead of just writing ax) is *not* for correctness, just for performance: avoid partial-register slowdowns if the caller reads eax
popq %rbx
retq
注意:movzwl array_us(,%rax,2)
等价,但不会更小.如果我们可以依赖 %rax
的高位在 fuint()
的返回值中归零,那么编译器可以使用 array_us(%rbx, %rax, 2)
而不是使用 add
insn.
Note: movzwl array_us(,%rax,2)
would be equivalent, but no smaller. If we could depend on the high bits of %rax
being zeroed in fuint()
's return value, the compiler could have used array_us(%rbx, %rax, 2)
instead of using the add
insn.
不定义 high32 是有意的,我认为这是一个很好的设计决定.
Leaving the high32 undefined is intentional, and I think it's a good design decision.
在执行 32 位操作时忽略高 32 位是免费的.32 位操作将其结果零扩展到 64 位免费,因此您只需要一个额外的 mov edx, edi
或其他东西,如果您可以直接使用 reg在 64 位寻址模式或 64 位操作中.
Ignoring the high 32 is free when doing 32-bit ops. A 32-bit operation zero-extends its result to 64-bit for free, so you only need an extra mov edx, edi
or something if you could have used the reg directly in a 64-bit addressing mode or 64-bit operation.
有些函数不会保存任何 insn,因为它们的 args 已经扩展到 64 位,所以调用者总是不得不这样做是一种潜在的浪费.一些函数以某种方式使用它们的 args,需要从 arg 的符号性进行相反的扩展,所以让被调用者决定做什么工作得很好.
Some functions won't save any insns from having their args already extended to 64-bit, so it's a potential waste for callers to always have to do it. Some functions use their args in a way that requires the opposite extension from the signedness of the arg, so leaving it up to the callee to decide what to do works well.
不管签名如何零扩展到 64 位对于大多数调用者来说都是免费的,并且可能是一个不错的选择 ABI 设计选择.由于 arg regs 无论如何都被破坏了,如果调用者想要在只传递低 32 的调用中保持完整的 64 位值,则它已经需要做一些额外的事情.因此,当您需要 64 位时,通常只需要额外花费调用之前的结果,然后将截断的版本传递给函数.在 x86-64 SysV 中,您可以在 RDI 中生成结果并使用它,然后调用 foo
只会查看 EDI.
Zero-extending to 64-bit regardless of signedness would be free for most callers, though, and might have been a good choice ABI design choice. Since arg regs are clobbered anyway, the caller already needs to do something extra if it wants to keep a full 64-bit value across a call where it only passes the low 32. Thus it usually only costs extra when you need a 64-bit result for something before the call, and then pass a truncated version to a function. In x86-64 SysV, you can generate your result in RDI and use it, and then call foo
which will only look at EDI.
16 位和 8 位操作数大小通常会导致错误的依赖关系(AMD、P4 或 Silvermont,以及后来的 SnB 系列),或部分寄存器停顿(SnB 之前)或轻微减速(Sandybridge),因此要求 8 和 16b 类型扩展到 32b 以进行 arg 传递的未记录行为是有道理的.请参阅为什么 GCC 不使用部分寄存器?,了解有关这些微架构的更多详细信息.
16-bit and 8-bit operand-sizes often lead to false dependencies (AMD, P4, or Silvermont, and later SnB-family), or partial-register stalls (pre SnB) or minor slowdowns (Sandybridge), so the undocumented behaviour of requiring 8 and 16b types to be extended to 32b for arg-passing makes some sense. See Why doesn't GCC use partial registers? for more details on those microarchitectures.
这对于实际代码中的代码大小来说可能不是什么大问题,因为小函数是/应该是静态内联
,而arg-handling insns是大函数的一小部分强>.当编译器可以看到两个定义时,即使没有内联,过程间优化也可以消除调用之间的开销.(IDK 编译器在实践中在这方面做得如何.)
This probably not a big deal for code-size in real code, since tiny functions are / should be static inline
, and arg-handling insns are a small part of bigger functions. Inter-procedural optimization can remove overhead between calls when the compiler can see both definitions, even without inlining. (IDK how well compilers do at this in practice.)
我不确定更改函数签名以使用 uintptr_t
是否会帮助或损害 64 位指针的整体性能.我不会担心标量的堆栈空间.在大多数函数中,编译器会推送/弹出足够的调用保留寄存器(如 %rbx
和 %rbp
),以将其自己的变量保存在寄存器中.8B 溢出而不是 4B 的一点额外空间可以忽略不计.
I'm not sure whether changing function signatures to use uintptr_t
will help or hurt overall performance with 64-bit pointers. I wouldn't worry about stack space for scalars. In most functions, the compiler pushes/pops enough call-preserved registers (like %rbx
and %rbp
) to keep its own variables live in registers. A tiny bit extra space for 8B spills instead of 4B is negligible.
就代码大小而言,使用 64 位值需要在某些 insn 上使用 REX 前缀,否则就不需要.如果在将 32 位值用作数组索引之前需要对 32 位值进行任何操作,则可以免费将零扩展到 64 位.如果需要,符号扩展总是需要额外的指令.但是编译器可以从一开始就将其作为 64 位有符号值进行符号扩展并使用它来节省指令,代价是需要更多的 REX 前缀.(有符号溢出是 UB,未定义为环绕,因此编译器通常可以避免在使用 arr[i]
的 int i
循环内重做符号扩展.)
As far as code-size, working with 64-bit values requires a REX prefix on some insns that wouldn't have otherwise needed one. Zero-extending to 64-bit happens for free if any operations are required on a 32-bit value before it gets used as an array index. Sign-extension always takes an extra instruction if it's required. But compilers can sign-extend and work with it as a 64-bit signed value from the start to save instructions, at the cost of needing more REX prefixes. (Signed overflow is UB, not defined to wrap around, so compilers can often avoid redoing sign-extension inside a loop with an int i
that uses arr[i]
.)
在合理范围内,现代 CPU 通常更关心 insn 数量而不是 insn 大小.热代码通常会从拥有它们的 CPU 中的 uop 缓存中运行.尽管如此,较小的代码可以提高 uop 缓存的密度.如果你可以在不使用更多或更慢的 insn 的情况下节省代码大小,那么它就是一个胜利,但通常不值得牺牲任何其他东西,除非它是很多的代码大小.
Modern CPUs usually care more about insn count than insn size, within reason. Hot code will often be running from the uop cache in CPUs that have them. Still, smaller code can improve density in the uop cache. If you can save code size without using more or slower insns, then it's a win, but not usually worth sacrificing anything else for unless it's a lot of code size.
就像一个额外的 LEA 指令允许 [reg + disp8]
为后面的十几个指令寻址,而不是 disp32
.或者在多个mov [rdi+n], 0
指令之前使用xor eax,eax
将imm32=0 替换为一个寄存器源.(特别是如果这允许微融合,而 RIP 相对 + 立即数是不可能的,因为真正重要的是前端 uop 计数,而不是指令计数.)
Like maybe one extra LEA instruction to allow [reg + disp8]
addressing for a dozen later instructions, instead of disp32
. Or xor eax,eax
before multiple mov [rdi+n], 0
instructions to replace the imm32=0 with a register source. (Especially if that allows micro-fusion where it wouldn't be possible with a RIP-relative + immediate, because what really matters is front-end uop count, not instruction count.)
这篇关于将 32 位偏移量添加到 x86-64 ABI 的指针时是否需要符号或零扩展?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!