问题描述
GCC 有 128 位整数.使用这些,我可以让编译器使用 mul
(或只有一个操作数的 imul
)指令.例如
GCC has 128-bit integers. Using these I can get the compiler to use the mul
(or imul
with only one operand) instructions. For example
uint64_t x,y;
unsigned __int128 z = (unsigned __int128)x*y;
产生mul
.我已经用它创建了一个 128x128 到 256 的函数(如果您有兴趣,请在更新之前查看此问题的结尾,了解相关代码).
produces mul
. I have used this to create a 128x128 to 256 function (see the end of this question, before the update, for code for that if you're interested).
现在我想做 256 位加法,除了使用汇编之外,我还没有找到让编译器使用 ADC
的方法.我可以使用汇编程序,但我想要内联函数以提高效率.编译器已经生成了一个高效的 128x128 到 256 函数(原因我在这个问题的开头解释过)所以我不明白为什么我也应该在汇编中重写它(或编译器已经有效实现的任何其他函数).
Now I want to do 256-bit addition and I have not found a way to get the compiler to use ADC
except by using assembly. I could use an assembler but I want inline functions for efficiency. The compiler already produces an efficient 128x128 to 256 function (for the reason I explained at the start of this question) so I don't see why I should rewrite this in assembly as well (or any other functions which the compiler already implements efficiently).
这是我想出的内联汇编函数:
Here is the inline assembly function I have come up with:
#define ADD256(X1, X2, X3, X4, Y1, Y2, Y3, Y4)
__asm__ __volatile__ (
"addq %[v1], %[u1]
"
"adcq %[v2], %[u2]
"
"adcq %[v3], %[u3]
"
"adcq %[v4], %[u4]
"
: [u1] "+&r" (X1), [u2] "+&r" (X2), [u3] "+&r" (X3), [u4] "+&r" (X4)
: [v1] "r" (Y1), [v2] "r" (Y2), [v3] "r" (Y3), [v4] "r" (Y4))
(可能不是每个输出都需要早期的clobber修饰符,但我得到了至少没有最后两个的错误结果).(编者注:在读取所有输入之前不会写入最后一个输出,并且不声明为 early-clobber 是安全的.)
(probably not every output needs a early clobber modifier but I get the wrong result without at least the last two). (editor's note: the last output isn't written until all inputs have been read, and would be safe to not declare as early-clobber.)
这是一个在 C 中做同样事情的函数
And here is a function which does the same thing in C
void add256(int256 *x, int256 *y) {
uint64_t t1, t2;
t1 = x->x1; x->x1 += y->x1;
t2 = x->x2; x->x2 += y->x2 + ((x->x1) < t1);
t1 = x->x3; x->x3 += y->x3 + ((x->x2) < t2);
x->x4 += y->x4 + ((x->x3) < t1);
}
为什么需要组装?为什么编译器不能编译add256
函数来使用进位标志?有没有办法强制编译器执行此操作(例如,我可以更改 add256
以便执行此操作吗)?有人想为不支持内联汇编的编译器 (在汇编中编写所有函数?)为什么没有内在函数?
Why is assembly necessary for this? Why can't the compiler compile the add256
function to use the carry flags? Is there a way to coerce the compiler to do this (e.g. can I change add256
so that it does this)? What is someone suppose to do for a compiler which does not support inline assembly (write all the functions in assembly?) Why are there no intrinsic for this?
这是 128x128 到 256 的函数
Here is the 128x128 to 256 function
void muldwu128(int256 *w, uint128 u, uint128 v) {
uint128 t;
uint64_t u0, u1, v0, v1, k, w1, w2, w3;
u0 = u >> 64L;
u1 = u;
v0 = v >> 64L;
v1 = v;
t = (uint128)u1*v1;
w3 = t;
k = t >> 64L;
t = (uint128)u0*v1 + k;
w2 = t;
w1 = t >> 64L;
t = (uint128)u1*v0 + w2;
k = t >> 64L;
w->hi = (uint128)u0*v0 + w1 + k;
w->lo = (t << 64L) + w3;
}
某些类型定义:
typedef __int128 int128;
typedef unsigned __int128 uint128;
typedef union {
struct {
uint64_t x1;
uint64_t x2;
int64_t x3;
int64_t x4;
};
struct {
uint128 lo;
int128 hi;
};
} int256;
更新:
我的问题主要是这些问题的重复:
My question is largely a duplicate of these questions:
- get-gcc-to-use-carry-logic-for-arbitrary-precision-arithmetic-without-inline-assembly
- efficient-128-bit-addition-using-carry-flag
- multiword-addition-in-c.
英特尔有一篇很好的文章(新指令支持大整数运算),讨论了大整数运算和三个新指令 MULX、ADCX、ADOX.他们写道:
Intel has a good article (New Instructions Support Large Integer Arithmetic) which discusses large integer arithmetic and the three new instructions MULX, ADCX, ADOX. They write:
mulx 的内在定义,adcx 和adox 也将被集成到编译器中.这是第一个带进位相加"类型指令的示例使用内在因素.内在的支持将使用户能够实现大型使用高级编程语言的整数运算,例如C/C++.
内在函数是
unsigned __int64 umul128(unsigned __int64 a, unsigned __int64 b, unsigned __int64 * hi);
unsigned char _addcarry_u64(unsigned char c_in, unsigned __int64 a, unsigned __int64 b, unsigned __int64 *out);
unsigned char _addcarryx_u64(unsigned char c_in, unsigned __int64 a, unsigned __int64 b, unsigned __int64 *out);
顺便说一句,MSVC 已经有一个 _umul128
内在的.因此,即使 MSVC 没有 __int128
,_umul128
内在函数也可用于生成 mul
并因此生成 128 位乘法.
Incidentally, MSVC already has a _umul128
intrinsic. So even though MSVC does not have __int128
the _umul128
intrinsic can be used to generate mul
and therefore 128 bit multiplication.
MULX
在 BMI2 (Haswell) 中.ADCX
和 ADOX
指令自 Broadwell 起可用,作为 ADX 扩展.太糟糕了,自 1979 年的 8086 以来,ADC
没有内在函数.这将解决内联汇编问题.
MULX
is in BMI2 (Haswell). The ADCX
and ADOX
instructions are available since Broadwell, as the ADX extension. It's too bad there is no intrinsic for ADC
which has been available since the 8086 in 1979. That would solve the inline assembly problem.
(编者注:英特尔的内在指南确实定义了_addcarry_u64
对于基线 x86-64,但也许并非所有编译器都实现了它.但是,gcc 通常会低效地编译它和/或 _addcarryx
,经常使用 setc
将 CF 溢出到一个整数,而不是更好地对指令进行排序.)
(Editor's note: Intel's intrinsics guide does define _addcarry_u64
for baseline x86-64, but perhaps not all compilers implemented it. However, gcc typically compiles it and/or _addcarryx
inefficiently, often spilling CF to an integer with setc
instead of ordering instructions better.)
GCC 的 __int128
codegen will 使用 mulx
如果 BMI2 已启用(例如使用 -mbmi2
或 -march=haswell
).
GCC's __int128
codegen will use mulx
if BMI2 is enabled (e.g. using -mbmi2
or -march=haswell
).
我按照 Lưu Vĩnh Phúc 的建议尝试了带有内置进位的 Clang 添加
I tried the Clang's add with carry builtins as suggested by Lưu Vĩnh Phúc
void add256(int256 *x, int256 *y) {
unsigned long long carryin=0, carryout;
x->x1 = __builtin_addcll(x->x1, y->x1, carryin, &carryout); carryin = carryout;
x->x2 = __builtin_addcll(x->x2, y->x2, carryin, &carryout); carryin = carryout;
x->x3 = __builtin_addcll(x->x3, y->x3, carryin, &carryout); carryin = carryout;
x->x4 = __builtin_addcll(x->x4, y->x4, carryin, &carryout);
}
但这不会生成ADC
,而且比我预期的要复杂.
but this does not generated ADC
and it's more complicated than I expect.
推荐答案
我找到了使用 _addcarry_u64
内在的 ICC 13.0.01 的解决方案
I found a solution with ICC 13.0.01 using the _addcarry_u64
intrinsic
void add256(uint256 *x, uint256 *y) {
unsigned char c = 0;
c = _addcarry_u64(c, x->x1, y->x1, &x->x1);
c = _addcarry_u64(c, x->x2, y->x2, &x->x2);
c = _addcarry_u64(c, x->x3, y->x3, &x->x3);
_addcarry_u64(c, x->x4, y->x4, &x->x4);
}
生产
L__routine_start_add256_0:
add256:
xorl %r9d, %r9d #25.9
movq (%rsi), %rax #22.9
addq %rax, (%rdi) #22.9
movq 8(%rsi), %rdx #23.9
adcq %rdx, 8(%rdi) #23.9
movq 16(%rsi), %rcx #24.9
adcq %rcx, 16(%rdi) #24.9
movq 24(%rsi), %r8 #25.9
adcq %r8, 24(%rdi) #25.9
setb %r9b #25.9
ret #26.1
我用 -O3
编译.我不知道如何使用 ICC 启用 adx
.也许我需要 ICC 14?
I compiled with -O3
. I don't know how to enable adx
with ICC. Maybe I need ICC 14?
这正是我期望的 1 个 addq
和三个 adcq
.
That's exactly 1 addq
and three adcq
like I expect.
对于 Clang,使用 -O3 -madx
的结果是一团糟
With Clang the result using -O3 -madx
is a mess
add256(uint256*, uint256*): # @add256(uint256*, uint256*)
movq (%rsi), %rax
xorl %ecx, %ecx
xorl %edx, %edx
addb $-1, %dl
adcq %rax, (%rdi)
addb $-1, %cl
movq (%rdi), %rcx
adcxq %rax, %rcx
setb %al
movq 8(%rsi), %rcx
movb %al, %dl
addb $-1, %dl
adcq %rcx, 8(%rdi)
addb $-1, %al
movq 8(%rdi), %rax
adcxq %rcx, %rax
setb %al
movq 16(%rsi), %rcx
movb %al, %dl
addb $-1, %dl
adcq %rcx, 16(%rdi)
addb $-1, %al
movq 16(%rdi), %rax
adcxq %rcx, %rax
setb %al
movq 24(%rsi), %rcx
addb $-1, %al
adcq %rcx, 24(%rdi)
retq
如果没有在 Clang 中启用 -madx
,结果也好不到哪里去.
Without enabling -madx
in Clang the result is not much better.
显然 MSVC 已经有 _addcarry_u64
.我试过了,它和 ICC 一样好(1x add
和 3x adc
).
这篇关于使用进位标志的多字加法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!