本文介绍了在某些情况下,在x86-64 Intel/AMD CPU上,128bit/64bit硬件无符号除法能否比64bit/32bit除法更快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

能否通过硬件128bit/64bit除法指令执行缩放的64bit/32bit除法,例如:

; Entry arguments: Dividend in EAX, Divisor in EBX
shl rax, 32  ;Scale up the Dividend by 2^32
xor rdx,rdx
and rbx, 0xFFFFFFFF  ;Clear any garbage that might have been in the upper half of RBX
div rbx  ; RAX = RDX:RAX / RBX

...在某些特殊情况下,比硬件64位/32位除法指令执行的缩放64位/32位除法更快,例如:

; Entry arguments: Dividend in EAX, Divisor in EBX
mov edx,eax  ;Scale up the Dividend by 2^32
xor eax,eax
div ebx  ; EAX = EDX:EAX / EBX

某些特殊情况"是指异常的红利和除数.我只想比较div指令.

解决方案

您在问有关将uint64_t / uint64_t C除法优化为64b/32b =>的问题. 32b x86 asm除法,已知除数为32位时.当然,编译器必须避免在完全有效的(在C中)64位除法中出现#DE异常的可能性,否则,它就不会遵循as-if规则.因此,只有在商数可以容纳32位的情况下,它才能执行此操作.

是的,这是胜利,或者至少是收支平衡.在某些CPU上,甚至值得在运行时检查这种可能性,因为64位除法速度要慢得多. 但是,不幸的是,即使您设法向x86编译器提供足够的信息以使其可以证明自己是安全的,即使当前的x86编译器也没有通过优化程序来寻找这种优化方式.例如if (edx >= ebx) __builtin_unreachable();上次尝试没有帮助.


对于相同的输入,32位操作数大小将始终至少与之一样

16或8位可能比32慢,因为它们在写入输出时可能有错误的依赖性,但是写入32位寄存器零扩展到64可以避免这种情况. (这就是为什么mov ecx, ebx是将ebx零扩展到64位的好方法,比and更好的方法,该值不能被编码为32位符号扩展的立即数,如harold所指出的).但是,除了部分寄存器的恶作剧外,16位和8位除法速度通常也快于32位,甚至还不差.

在AMD CPU上,除法性能并不取决于操作数大小,而仅取决于数据. 128/64位的0 / 1应该比任何较小的操作数大小的最坏情况都要快. AMD的整数除法指令只有2微秒(大概是因为它必须写入2个寄存器),所有逻辑都在执行单元中完成.

16位/8位=> Ryzen上的8位除法是一个uop(因为它只需要写AH:AL = AX).


在Intel CPU上,div/idiv被微编码为尽可能多的微码.对于最大32位(Skylake = 10)的所有操作数大小,大约有相同的uops数量,但是 64位要慢很多 . (Skylake div r64为36 uops,Skylake idiv r64为57 uops).请参阅Agner Fog的说明表: https://agner.org/optimize/

在Skylake上,最高32位操作数大小的

div/idiv吞吐量固定为每6个周期1个.但是div/idiv r64吞吐量是每24-90个周期之一.

另请参见 ,以进行修改REX.W的特定性能实验在现有二进制文件中添加前缀以将div r64更改为div r32会使吞吐量相差约3倍.

为什么Clang会这样做这个优化技巧仅从Sandy Bridge开始吗?显示,当股息较小时,在为Intel CPU进行调整时,机会地使用32位除法来生成clang.但是您有一个大红利和一个足够大的除数,这是一个更复杂的情况.这种clang优化仍然使asm的上半部分清零,从不使用非零或非符号扩展的EDX.


我假设您首先将32位整数强制转换为uint64_t ,以避免UB并在C抽象机中获得正常的uint64_t / uint64_t.

这是有道理的:您的方式并不安全,当edx >= ebx时它将以#DE出现故障. x86除法在商溢出AL/AX/EAX/RAX时发生故障默默地被截断.无法禁用它.

因此,编译器通常仅在cdqcqo之后使用idiv,并且仅在将上半部分清零之后才使用div,除非您使用内部或内联汇编程序使自己对代码出错的可能性有所了解.在C语言中,x / y仅在y = 0(或带符号的INT_MIN / -1也允许对进行故障处理)时发生故障.

GNU C没有内在的宽除法,但是MSVC有_udiv64 . (对于gcc/clang,大于1的寄存器除法使用辅助函数,该函数会尝试针对少量输入进行优化.但是,这对于64位计算机上的64/32除法无济于事,其中GCC和clang仅使用128/64位除法指令.)

即使有某种方法可以向编译器保证您的除数足够大以使商适合32位,但根据我的经验,当前的gcc和clang并不会寻求这种优化.对于您的情况而言,这将是一个有用的优化(如果总是安全的话),但是编译器不会寻找它.


1:更具体地说,ISO C将那些情况描述为未定义的行为".一些ISA(如ARM)具有无故障的划分指令. C UB表示任何事情都可能发生,包括仅截断为0或其他一些整数结果.参见为什么整数除以-1(负一个)是FPE吗?是AArch64与x86代码源和结果的示例. 允许表示错误并不表示需要表示错误.

Can a scaled 64bit/32bit division performed by the hardware 128bit/64bit division instruction, such as:

; Entry arguments: Dividend in EAX, Divisor in EBX
shl rax, 32  ;Scale up the Dividend by 2^32
xor rdx,rdx
and rbx, 0xFFFFFFFF  ;Clear any garbage that might have been in the upper half of RBX
div rbx  ; RAX = RDX:RAX / RBX

...be faster in some special cases than the scaled 64bit/32bit division performed by the hardware 64bit/32bit division instruction, such as:

; Entry arguments: Dividend in EAX, Divisor in EBX
mov edx,eax  ;Scale up the Dividend by 2^32
xor eax,eax
div ebx  ; EAX = EDX:EAX / EBX

By "some special cases" I mean unusual dividends and divisors.I am interested in comparing the div instruction only.

解决方案

You're asking about optimizing uint64_t / uint64_t C division to a 64b / 32b => 32b x86 asm division, when the divisor is known to be 32-bit. The compiler must of course avoid the possibility of a #DE exception on a perfectly valid (in C) 64-bit division, otherwise it wouldn't have followed the as-if rule. So it can only do this if it's provable that the quotient will fit in 32 bits.

Yes, that's a win or at least break-even. On some CPUs it's even worth checking for the possibility at runtime because 64-bit division is so much slower. But unfortunately current x86 compilers don't have an optimizer pass to look for this optimization even when you do manage to give them enough info that they could prove it's safe. e.g. if (edx >= ebx) __builtin_unreachable(); doesn't help last time I tried.


For the same inputs, 32-bit operand-size will always be at least as fast

16 or 8-bit could maybe be slower than 32 because they may have a false dependency writing their output, but writing a 32-bit register zero-extends to 64 to avoid that. (That's why mov ecx, ebx is a good way to zero-extend ebx to 64-bit, better than and a value that's not encodeable as a 32-bit sign-extended immediate, like harold pointed out). But other than partial-register shenanigans, 16-bit and 8-bit division are generally also as fast as 32-bit, or not worse.

On AMD CPUs, division performance doesn't depend on operand-size, just the data. 0 / 1 with 128/64-bit should be faster than worst-case of any smaller operand-size. AMD's integer-division instruction is only a 2 uops (presumably because it has to write 2 registers), with all the logic done in the execution unit.

16-bit / 8-bit => 8-bit division on Ryzen is a single uop (because it only has to write AH:AL = AX).


On Intel CPUs, div/idiv is microcoded as many uops. About the same number of uops for all operand-sizes up to 32-bit (Skylake = 10), but 64-bit is much much slower. (Skylake div r64 is 36 uops, Skylake idiv r64 is 57 uops). See Agner Fog's instruction tables: https://agner.org/optimize/

div/idiv throughput for operand-sizes up to 32-bit is fixed at 1 per 6 cycles on Skylake. But div/idiv r64 throughput is one per 24-90 cycles.

See also Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux for a specific performance experiment where modifying the REX.W prefix in an existing binary to change div r64 into div r32 made a factor of ~3 difference in throughput.

And Why does Clang do this optimization trick only from Sandy Bridge onward? shows clang opportunistically using 32-bit division when the dividend is small, when tuning for Intel CPUs. But you have a large dividend and a large-enough divisor, which is a more complex case. That clang optimization is still zeroing the upper half of the dividend in asm, never using a non-zero or non-sign-extended EDX.


I'm assuming you cast that 32-bit integer to uint64_t first, to avoid UB and get a normal uint64_t / uint64_t in the C abstract machine.

That makes sense: Your way wouldn't be safe, it will fault with #DE when edx >= ebx. x86 division faults when the quotient overflows AL / AX / EAX / RAX, instead of silently truncating. There's no way to disable that.

So compilers normally only use idiv after cdq or cqo, and div only after zeroing the high half, unless you use an intrinsic or inline asm to open yourself up to the possibility of your code faulting. In C, x / y only faults if y = 0 (or for signed, INT_MIN / -1 is also allowed to fault).

GNU C doesn't have an intrinsic for wide division, but MSVC has _udiv64. (With gcc/clang, division wider than 1 register uses a helper function which does try to optimize for small inputs. But this doesn't help for 64/32 division on a 64-bit machine, where GCC and clang just use the 128/64-bit division instruction.)

Even if there were some way to promise the compiler that your divisor would be big enough to make the quotient fit in 32 bits, current gcc and clang don't look for that optimization in my experience. It would be a useful optimization for your case (if it's always safe), but compilers won't look for it.


Footnote 1: To be more specific, ISO C describes those cases as "undefined behaviour"; some ISAs like ARM have non-faulting division instructions. C UB means anything can happen, including just truncation to 0 or some other integer result. See Why does integer division by -1 (negative one) result in FPE? for an example of AArch64 vs. x86 code-gen and results. Allowed to fault doesn't mean required to fault.

这篇关于在某些情况下,在x86-64 Intel/AMD CPU上,128bit/64bit硬件无符号除法能否比64bit/32bit除法更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-24 10:27