本文介绍了如何使用乘加(FMA)与SSE / AVX指令的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!


我已经了解到,某些Intel / AMD的CPU可以做点大的乘法和SSE / AVX补充:结果FLOPS每个周期的沙滩桥和Haswell的SSE2 / AVX / AVX2 。

I have learned that some Intel/AMD CPUs can do simultanous multiply and add with SSE/AVX:
FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2.


I like to know how to do this best in code and I also want to know how it's done internally in the CPU. I mean with the super-scalar architecture. Let's say I want to do a long sum such as the following in SSE:

//sum = a1*b1 + a2*b2 + a3*b3 +... where a is a scalar and b is a SIMD vector (e.g. from matrix multiplication)
sum = _mm_set1_ps(0.0f);
a1  = _mm_set1_ps(a[0]);
b1  = _mm_load_ps(&b[0]);
sum = _mm_add_ps(sum, _mm_mul_ps(a1, b1));

a2  = _mm_set1_ps(a[1]);
b2  = _mm_load_ps(&b[4]);
sum = _mm_add_ps(sum, _mm_mul_ps(a2, b2));

a3  = _mm_set1_ps(a[2]);
b3  = _mm_load_ps(&b[8]);
sum = _mm_add_ps(sum, _mm_mul_ps(a3, b3));

我的问题是,这如何地转化为同时乘法和加法?数据可以依赖?我的意思是可以在CPU做 _mm_add_ps(总和,_mm_mul_ps(A1,B1))同时或做乘法使用的寄存器,并添加已是独立的?

My question is how does this get converted to simultaneous multiply and add? Can the data be dependent? I mean can the CPU do _mm_add_ps(sum, _mm_mul_ps(a1, b1)) simultaneously or do the registers used in the multiplication and add have to be independent?

最后,这如何适用于FMA(与Haswell的)?为 _mm_add_ps(总和,_mm_mul_ps(A1,B1))自动转换为一个单一的FMA指令或微操作?

Lastly how does this apply to FMA (with Haswell)? Is _mm_add_ps(sum, _mm_mul_ps(a1, b1)) automatically converted to a single FMA instruction or micro-operation?



The compiler is not allowed to fuse a separated add and multiply unless you allow for a relaxed floating-point model.

这是因为FMA只有一个舍入,而一个ADD + MUL有两个。因此,编译器将通过融合违反严格IEEE浮点行为。

This is because an FMA has only one rounding, while an ADD + MUL has two. So the compiler will violate strict IEEE floating-point behavior by fusing.


Even if you enable relaxed floating-point, the compiler might still choose not to fuse since it might expect you to know what you're doing if you're already using intrinsics.


So the best way to make sure you actually get the FMA instructions you want is you actually use the provided intrinsics for them:

FMA3内部函数:(AVX2 - 英特尔的Haswell)

FMA3 Intrinsics: (AVX2 - Intel Haswell)

  • _mm_fmadd_pd() _ mm256_fmadd_pd()

  • _mm_fmadd_ps() _mm256_fmadd_ps()

  • 和大约一个极大其他变化...

  • _mm_fmadd_pd(), _mm256_fmadd_pd()
  • _mm_fmadd_ps(), _mm256_fmadd_ps()
  • and about a gazillion other variations...

(XOP - AMD推土机)

FMA4 Intrinsics: (XOP - AMD Bulldozer)

  • _mm_macc_pd() _mm256_macc_pd()

  • _mm_macc_ps() _mm256_macc_ps()

  • 和大约一个极大其他变化...

这篇关于如何使用乘加(FMA)与SSE / AVX指令的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-29 09:37