本文介绍了为什么 x86_64 CPU 上的通用寄存器没有融合乘加?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Intel 和 AMD x86_64 处理器上,SIMD 矢量化寄存器具有特定的融合乘加功能,但通用(标量、整数)寄存器不要 - 你基本上需要相乘,然后相加(除非你能把东西放进一个lea).

这是为什么?我的意思是,它是无用的以至于不值得开销吗?

解决方案

整数乘法很常见,但不是整数乘法常见的事情之一.但是对于浮点数,乘法和加法一直在使用,而 FMA 为大量 ALU 绑定的 FP 代码提供了主要加速.

此外,浮点实际上避免了 FMA 的精度损失(x*y 内部临时值在添加之前根本没有四舍五入).这就是为什么 ISO C99/C++ fma()数学库函数存在,以及为什么在没有硬件 FMA 支持的情况下实现起来很慢.

整数 FMA(或乘法累加,又名 MAC)与单独的乘法和加法相比没有任何精度优势.

某些非 x86 ISA 确实提供整数 FMA.这不是没用,但英特尔和 AMD 都没有费心将它包含在 直到 AVX512-IFMA(这仍然仅适用于 SIMD,基本上公开了双精度 FMA/vmulpd 所需的 52 位尾数乘法器电路,供整数指令使用).

非 x86 示例包括:

  • MIPS32,madd/maddu(无符号)乘积到 hi/lo 寄存器(用作特殊寄存器)通过常规乘法和除法指令的目标).

  • ARMsmlal 和朋友(32x32=>64 位 MAC,或 16x16=>32 位),也可用于无符号整数.操作数是常规的 R0..R15 通用寄存器.

整数寄存器 FMA 在 x86 上很有用,但具有 3 个整数输入的 uops 很少见.CMOV 和 ADC 有 3 个输入,但其中之一是标志.即便如此,在为 Haswell 中的 FP FMA 添加 3 输入 uop 支持之后,他们直到 Broadwell 才在 Intel 上解码为单个 uop.

Haswell 和更高版本可以跟踪具有 3 个整数输入的融合域 uops,但是,对于(某些) 具有索引寻址模式的微融合指令.Sandybridge/Ivybridge 取消层压指令,如 add eax, [rdx+rcx].(但 Nehalem 可以像 Haswell 一样使它们保持微融合;SnB 简化了融合域 uop 格式).无论如何,这是融合域,而不是在调度程序中.只有 Broadwell/Skylake 可以在调度器中跟踪 3 输入整数 uops,而且这仅适用于 2 个整数 + 标志,而不是 3 个整数寄存器.

英特尔确实使用统一"调度程序,其中 FP 和整数操作使用相同的调度程序,并且它可以跟踪正确的 3 输入 FP FMA.所以IDK如果有技术障碍.如果不是,IDK 为什么英特尔没有将整数 FMA 作为 BMI2 的一部分或其他内容,其中添加了内容 like mulx(2 输入 2 输出 mul 主要是显式操作数,不像传统的 mul 使用 rdx:rax.)

SSE2/SSSE3确实有用于向量寄存器的整数多加指令,但只有在加宽 16x16 => 32 位(SSE2 pmaddwd) 或 (unsigned)8x(signed)8=>16-bit (SSSE3 pmaddubsw).>

但那些只是 2 输入指令,所以即使有乘法和加法,它也与 FMA 非常不同.

脚注:问题标题最初说没有用于标量"的 FMA.有具有相同 FMA3 扩展的标量 FP FMA,它添加了这些的打包版本:VFMADD231SD 和朋友们在标量双精度上进行操作,对于 XMM 寄存器中的标量浮点数,可以使用相同风格的 vfmaddXXXss.

On Intel and AMD x86_64 processors, SIMD vectorized registers have specific fused-multiply-add capabilities, but general-purpose (scalar, integer) registers don't - you basically need to multiply, then add (unless you can fit things into an lea).

Why is that? I mean, is it that useless so as to not be worth the overhead?

解决方案

Integer multiply is common, but not one of the most common things to do with integers. But with floating point numbers, multiplying and adding is used all the time, and FMA provides major speedups for lots of ALU-bound FP code.

Also, floating point actually avoids precision loss with an FMA (the x*y internal temporary isn't rounded off at all before adding). This is why the ISO C99 / C++ fma() math library function exists, and why it's slow to implement without hardware FMA support.

Integer FMA (or multiply-accumulate, aka MAC) doesn't have any precision benefit vs. separate multiply and add.


Some non-x86 ISAs do provide integer FMA. It's not useless, but Intel and AMD both haven't bothered to include it until AVX512-IFMA (and that's still only for SIMD, basically exposing the 52-bit mantissa multiplier circuits needed for double-precision FMA/vmulpd for use by integer instructions).

Non-x86 examples include:

  • MIPS32, madd / maddu (unsigned) to multiply-accumulate into the hi / lo registers (the special registers used as a destination by regular multiply and divide instructions).

  • ARM smlal and friends (32x32=>64 bit MAC, or 16x16=>32 bit), also available for unsigned integer. Operands are regular R0..R15 general purpose registers.


An integer register FMA would be useful on x86, but uops that have 3 integer inputs are rare. CMOV and ADC have 3 inputs, but one of those is flags. Even then, they didn't decode to a single uop on Intel until Broadwell, after 3-input uop support was added for FP FMA in Haswell.

Haswell and later can track fused-domain uops with 3 integer inputs, though, for (some) micro-fused instructions with indexed addressing modes. Sandybridge/Ivybridge un-laminate instructions like add eax, [rdx+rcx]. (But Nehalem could keep them micro-fused, like Haswell; SnB simplified the fused-domain uop format). Anyway, that's fused domain, not in the scheduler. Only Broadwell/Skylake can track 3-input integer uops in the scheduler, and that's only for 2 integer + flags, not 3 integer registers.

Intel does use a "unified" scheduler, where FP and integer ops use the same scheduler, and it can track proper 3-input FP FMA. So IDK if there's a technical obstacle. If not, IDK why Intel didn't include integer FMA as part of BMI2 or something, which added stuff like mulx (2-input 2-output mul with mostly explicit operands, unlike legacy mul that uses rdx:rax.)


SSE2/SSSE3 does have integer mul-add instructions for vector registers, but only horizontal add after widening 16x16 => 32-bit (SSE2 pmaddwd) or (unsigned)8x(signed)8=>16-bit (SSSE3 pmaddubsw).

But those are only 2-input instructions, so even though there's a multiply and an add, it's very different from FMA.


Footnote: The question title originally said there was no FMA "for scalars". There is scalar FP FMA with the same FMA3 extension that added the packed versions of these: VFMADD231SD and friends operate on scalar double-precision, and the same flavours of vfmaddXXXss are available for scalar float in XMM registers.

这篇关于为什么 x86_64 CPU 上的通用寄存器没有融合乘加?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-31 13:07