问题描述
我在内存中有两个128位十六进制数字,例如(小尾数):
I have two 128 bit numbers in memory in hexadecimal, for example (little endian):
x:0x12 0x45 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
y:0x36 0xa1 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
我必须在这两个数字之间执行无符号乘法,所以我的新数字将是:
I've to perform the unsigned multiplication between these two numbers so my new number will be:
z:0xcc 0xe3 0x7e 0x2b 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
现在,我知道可以将x和y的一半移到rax
和rbx
寄存器中,例如,执行mul
操作,并对另一半执行相同的操作.问题在于这样做会遗留残留物,我也不知道如何避免这种情况.我正要面对这个问题,大约需要4个小时,我能看到的唯一解决方案是二进制转换(and
<-> shl,1
).
Now, I'm aware that I can move the half x and y number into rax
and rbx
registers and, for example, do the mul
operation, and do the same with the other half. The problem is that by doing so I lose the carry-over and I've no idea how I can avoid that. It's about 4 hours I'm facing this problem and the only solution that can I see is the conversion in binary (and
<-> shl,1
).
您能给我一些有关此问题的信息吗?
我认为最好的解决方案是花一个字节的时间.
Can you give me some input about this problem?
I think the best solution is to take one byte par time.
推荐答案
像往常一样,询问编译器如何有效地执行操作:64位平台上的GNU C支持__int128_t
和.
As usual, ask a compiler how to do something efficiently: GNU C on 64-bit platforms supports __int128_t
and __uint128_t
.
__uint128_t mul128(__uint128_t a, __uint128_t b) { return a*b; }
编译为( gcc6. 2 -O3
on Godbolt )
compiles to (gcc6.2 -O3
on Godbolt)
imul rsi, rdx # tmp94, b
mov rax, rdi # tmp93, a
imul rcx, rdi # tmp95, a
mul rdx # b
add rcx, rsi # tmp96, tmp94
add rdx, rcx #, tmp96
ret
由于这是针对x86-64 System V调用约定的,因此a
位于RSI:RDI中,而b
位于RCX:RDX中. 结果在RDX:RAX中返回.
Since this is targeting the x86-64 System V calling convention, a
is in RSI:RDI, while b
is in RCX:RDX. The result is returned in RDX:RAX.
很巧的是它只需要一条MOV指令,因为gcc不需要a_upper * b_lower的上半部分结果,反之亦然.由于IMUL仅使用一次,因此可以用更快的2运算符形式的IMUL销毁一半的输入.
Pretty nifty that it only takes one MOV instruction, since gcc doesn't need the high-half result of a_upper * b_lower or vice versa. It can destroy the high halves of the inputs with the faster 2-operand form of IMUL since they're only used once.
使用-march=haswell
启用BMI2,gcc使用 MULX 甚至可以避免一个MOV.
With -march=haswell
to enable BMI2, gcc uses MULX to avoid even the one MOV.
有时候编译器的输出并不完美,但是通常,一般的策略是手动优化的一个很好的起点.
Sometimes compiler output isn't perfect, but very often the general strategy is a good starting point for optimizing by hand.
当然,如果您最初真正想要的是C语言中的128位乘法,只需使用编译器的内置支持即可.这样一来,优化程序就可以完成工作,通常比在inline-asm中编写几篇文章时提供更好的结果. ( https://gcc.gnu.org/wiki/DontUseInlineAsm ).
Of course, if what you really wanted in the first place was 128-bit multiplies in C, just use the compiler's built-in support for it. That lets the optimizer do its job, often giving better results than if you'd written a couple parts in inline-asm. (https://gcc.gnu.org/wiki/DontUseInlineAsm).
这篇关于如何在汇编中将两个十六进制128位数字相乘的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!