如何使用乘加（FMA）与SSE / AVX指令

本文介绍了如何使用乘加（FMA）与SSE / AVX指令的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经了解到，某些Intel / AMD的CPU可以做点大的乘法和SSE / AVX补充：结果FLOPS每个周期的沙滩桥和Haswell的SSE2 / AVX / AVX2 。

I have learned that some Intel/AMD CPUs can do simultanous multiply and add with SSE/AVX:
FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2.

我想知道如何做到这一点最好的code，我也想知道它是如何在CPU内部完成。我的意思是与超标量体系结构。比方说，我想做一个长总和，如SSE如下：

I like to know how to do this best in code and I also want to know how it's done internally in the CPU. I mean with the super-scalar architecture. Let's say I want to do a long sum such as the following in SSE:

//sum = a1*b1 + a2*b2 + a3*b3 +... where a is a scalar and b is a SIMD vector (e.g. from matrix multiplication)
sum = _mm_set1_ps(0.0f);
a1  = _mm_set1_ps(a[0]);
b1  = _mm_load_ps(&b[0]);
sum = _mm_add_ps(sum, _mm_mul_ps(a1, b1));

a2  = _mm_set1_ps(a[1]);
b2  = _mm_load_ps(&b[4]);
sum = _mm_add_ps(sum, _mm_mul_ps(a2, b2));

a3  = _mm_set1_ps(a[2]);
b3  = _mm_load_ps(&b[8]);
sum = _mm_add_ps(sum, _mm_mul_ps(a3, b3));
...

我的问题是，这如何地转化为同时乘法和加法？数据可以依赖？我的意思是可以在CPU做 _mm_add_ps（总和，_mm_mul_ps（A1，B1））同时或做乘法使用的寄存器，并添加已是独立的？

My question is how does this get converted to simultaneous multiply and add? Can the data be dependent? I mean can the CPU do _mm_add_ps(sum, _mm_mul_ps(a1, b1)) simultaneously or do the registers used in the multiplication and add have to be independent?

最后，这如何适用于FMA（与Haswell的）？为 _mm_add_ps（总和，_mm_mul_ps（A1，B1））自动转换为一个单一的FMA指令或微操作？

Lastly how does this apply to FMA (with Haswell)? Is _mm_add_ps(sum, _mm_mul_ps(a1, b1)) automatically converted to a single FMA instruction or micro-operation?

与SSE

如何使用乘加（FMA）与SSE / AVX指令

问题描述

推荐答案