问题描述
我在 C++ 代码的内部循环中使用了 128 位整数计数器.(不相关的背景:实际应用是在规则网格上计算有限差分方程,其中涉及重复递增大整数,即使是64位也不够精度,因为小舍入累积到足以影响答案.)
I'm using a 128 bit integer counter in the very inner loops of my C++ code. (Irrelevant background: The actual application is evaluating finite difference equations on a regular grid, which involves repetitively incrementing large integers, and even 64 bits isn't enough precision because small rounding accumulates enough to affect the answers.)
我已将整数表示为两个 64 位无符号长整型.我现在需要将这些值增加 128 位常量.这并不难,但您必须手动捕捉从低位字到高位字的进位.
I've represented the integer as two 64 bit unsigned longs. I now need to increment those values by a 128 bit constant. This isn't hard, but you have to manually catch the carry from the low word to the high word.
我有这样的工作代码:
inline void increment128(unsigned long &hiWord, unsigned long &loWord)
{
const unsigned long hiAdd=0x0000062DE49B5241;
const unsigned long loAdd=0x85DC198BCDD714BA;
loWord += loAdd;
if (loWord < loAdd) ++hiWord; // test_and_add_carry
hiWord += hiAdd;
}
这是紧凑而简单的代码.它有效.
This is tight and simple code. It works.
不幸的是,这大约是我运行时间的 20%.杀手线是 loWord 测试.如果我删除它,我显然会得到错误的答案,但运行时开销从 20% 下降到 4%!所以那个carry test特别贵!
Unfortunately this is about 20% of my runtime. The killer line is that loWord test. If I remove it, I obviously get the wrong answers but the runtime overhead drops from 20% to 4%! So that carry test is especially expensive!
我的问题:C++ 是否公开硬件进位标志,甚至作为 GCC 的扩展?如果实际编译指令使用加法使用最后进位指令进行 hiWord 加法,则似乎可以在没有上面的 test-and-add-carry 行的情况下完成加法.有没有办法重写 test-and-add-carry 行来让编译器使用内部操作码?
My question: Does C++ expose the hardware carry flags, even as an extension to GCC?It seems like the additions could be done without the test-and-add-carry line above if the actual compiled instructions used an add using last carry instruction for the hiWord addition.Is there a way to rewrite the test-and-add-carry line to get the compiler to use the intrinsic opcode?
推荐答案
其实如果你仔细写代码,gcc 会自动使用进位...
Actually gcc will use the carry automatically if you write your code carefully...
当前 GCC 可以将 hiWord += (loWord < loAdd);
优化为 add
/adc
(x86 的 add-with-carry).此优化是在 GCC5.3 中引入的.
Current GCC can optimize hiWord += (loWord < loAdd);
into add
/adc
(x86's add-with-carry). This optimization was introduced in GCC5.3.
- 在 64 位模式下使用单独的
uint64_t
块:https://godbolt.org/z/S2kGRz. - 在带有
uint32_t
块的 32 位模式下,同样的事情:https://godbolt.org/z/9FC9vc
- With separate
uint64_t
chunks in 64-bit mode: https://godbolt.org/z/S2kGRz. - And the same thing in 32-bit mode with
uint32_t
chunks: https://godbolt.org/z/9FC9vc
(编者注:当然,困难的部分是编写一个正确带有进位和进位的全加器;这在 C 中很难,GCC 不知道如何优化任何我'见过.)
(editor's note: Of course the hard part is writing a correct full-adder with carry in and carry out; that's hard in C and GCC doesn't know how to optimize any that I've seen.)
还相关:https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html 可以为您提供未签名或签名溢出检测的结果.
Also related: https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html can give you carry-out from unsigned, or signed-overflow detection.
较旧的 GCC,如 GCC4.5,将在添加的进位输出上进行分支或 setc
,而不是使用 adc
,并且仅使用 adc
(add-with-carry) 在 add
的标志结果上,如果你使用了 __int128
.(或 uint64_t
在 32 位目标上).请参阅 gcc 中是否有 128 位整数? - 仅在 64 位目标上,自 GCC4.1 起支持.
Older GCC, like GCC4.5, will branch or setc
on the carry-out from an add, instead of using adc
, and only used adc
(add-with-carry) on the flag-result from an add
if you used __int128
. (Or uint64_t
on a 32-bit target). See Is there a 128 bit integer in gcc? - only on 64-bit targets, supported since GCC4.1.
我用 gcc -O2 -Wall -Werror -S
编译了这段代码:
I compiled this code with gcc -O2 -Wall -Werror -S
:
void increment128_1(unsigned long &hiWord, unsigned long &loWord)
{
const unsigned long hiAdd=0x0000062DE49B5241;
const unsigned long loAdd=0x85DC198BCDD714BA;
loWord += loAdd;
if (loWord < loAdd) ++hiWord; // test_and_add_carry
hiWord += hiAdd;
}
void increment128_2(unsigned long &hiWord, unsigned long &loWord)
{
const unsigned long hiAdd=0x0000062DE49B5241;
const unsigned long loAdd=0x85DC198BCDD714BA;
loWord += loAdd;
hiWord += hiAdd;
hiWord += (loWord < loAdd); // test_and_add_carry
}
这是 increment128_1 的程序集:
This is the assembly for increment128_1:
.cfi_startproc
movabsq $-8801131483544218438, %rax
addq (%rsi), %rax
movabsq $-8801131483544218439, %rdx
cmpq %rdx, %rax
movq %rax, (%rsi)
ja .L5
movq (%rdi), %rax
addq $1, %rax
.L3:
movabsq $6794178679361, %rdx
addq %rdx, %rax
movq %rax, (%rdi)
ret
...这是 increment128_2 的程序集:
...and this is the assembly for increment128_2:
movabsq $-8801131483544218438, %rax
addq %rax, (%rsi)
movabsq $6794178679361, %rax
addq (%rdi), %rax
movabsq $-8801131483544218439, %rdx
movq %rax, (%rdi)
cmpq %rdx, (%rsi)
setbe %dl
movzbl %dl, %edx
leaq (%rdx,%rax), %rax
movq %rax, (%rdi)
ret
注意第二个版本中缺少条件分支.
Note the lack of conditional branches in the second version.
此外,引用通常对性能不利,因为 GCC 不得不担心别名...通常按值传递更好.考虑:
Also, references are often bad for performance, because GCC has to worry about aliasing... It is often better to just pass things by value. Consider:
struct my_uint128_t {
unsigned long hi;
unsigned long lo;
};
my_uint128_t increment128_3(my_uint128_t x)
{
const unsigned long hiAdd=0x0000062DE49B5241;
const unsigned long loAdd=0x85DC198BCDD714BA;
x.lo += loAdd;
x.hi += hiAdd + (x.lo < loAdd);
return x;
}
组装:
.cfi_startproc
movabsq $-8801131483544218438, %rdx
movabsq $-8801131483544218439, %rax
movabsq $6794178679362, %rcx
addq %rsi, %rdx
cmpq %rdx, %rax
sbbq %rax, %rax
addq %rcx, %rax
addq %rdi, %rax
ret
这实际上是三者中最严密的代码.
This is actually the tightest code of the three.
...好吧,所以他们中没有人真正自动使用进位:-).但是他们确实避免了条件分支,我敢打赌这是缓慢的部分(因为分支预测逻辑会出错一半).
...OK so none of them actually used the carry automatically :-). But they do avoid the conditional branch, which I bet is the slow part (since the branch prediction logic will get it wrong half the time).
还有一个,我在搜索时偶然发现.您知道 GCC 内置了对 128 位整数的支持吗?
And one more, which I stumbled across doing a little searching. Did you know GCC has built-in support for 128-bit integers?
typedef unsigned long my_uint128_t __attribute__ ((mode(TI)));
my_uint128_t increment128_4(my_uint128_t x)
{
const my_uint128_t hiAdd=0x0000062DE49B5241;
const unsigned long loAdd=0x85DC198BCDD714BA;
return x + (hiAdd << 64) + loAdd;
}
这个组件的性能和它得到的一样好:
The assembly for this one is about as good as it gets:
.cfi_startproc
movabsq $-8801131483544218438, %rax
movabsq $6794178679361, %rdx
pushq %rbx
.cfi_def_cfa_offset 16
addq %rdi, %rax
adcq %rsi, %rdx
popq %rbx
.cfi_offset 3, -16
.cfi_def_cfa_offset 8
ret
(不知道ebx
的push/pop是从哪里来的,不过这个还是不错的.)
(Not sure where the push/pop of ebx
came from, but this is still not bad.)
顺便说一下,所有这些都在 GCC 4.5.2 中.
All of these are with GCC 4.5.2, by the way.
这篇关于使用进位标志的高效 128 位加法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!