问题描述
我已经了解到,某些Intel / AMD的CPU可以做点大的乘法和SSE / AVX补充:结果FLOPS每个周期的沙滩桥和Haswell的SSE2 / AVX / AVX2 。
I have learned that some Intel/AMD CPUs can do simultanous multiply and add with SSE/AVX:
FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2.
我想知道如何做到这一点最好的code,我也想知道它是如何在CPU内部完成。我的意思是与超标量体系结构。比方说,我想做一个长总和,如SSE如下:
I like to know how to do this best in code and I also want to know how it's done internally in the CPU. I mean with the super-scalar architecture. Let's say I want to do a long sum such as the following in SSE:
//sum = a1*b1 + a2*b2 + a3*b3 +... where a is a scalar and b is a SIMD vector (e.g. from matrix multiplication)
sum = _mm_set1_ps(0.0f);
a1 = _mm_set1_ps(a[0]);
b1 = _mm_load_ps(&b[0]);
sum = _mm_add_ps(sum, _mm_mul_ps(a1, b1));
a2 = _mm_set1_ps(a[1]);
b2 = _mm_load_ps(&b[4]);
sum = _mm_add_ps(sum, _mm_mul_ps(a2, b2));
a3 = _mm_set1_ps(a[2]);
b3 = _mm_load_ps(&b[8]);
sum = _mm_add_ps(sum, _mm_mul_ps(a3, b3));
...
我的问题是,这如何地转化为同时乘法和加法?数据可以依赖?我的意思是可以在CPU做 _mm_add_ps(总和,_mm_mul_ps(A1,B1))
同时或做乘法使用的寄存器,并添加已是独立的?
My question is how does this get converted to simultaneous multiply and add? Can the data be dependent? I mean can the CPU do _mm_add_ps(sum, _mm_mul_ps(a1, b1))
simultaneously or do the registers used in the multiplication and add have to be independent?
最后,这如何适用于FMA(与Haswell的)?为 _mm_add_ps(总和,_mm_mul_ps(A1,B1))
自动转换为一个单一的FMA指令或微操作?
Lastly how does this apply to FMA (with Haswell)? Is _mm_add_ps(sum, _mm_mul_ps(a1, b1))
automatically converted to a single FMA instruction or micro-operation?
推荐答案
编译器不允许一个融合分离加和乘法,除非你允许一个轻松的浮点模型。
The compiler is not allowed to fuse a separated add and multiply unless you allow for a relaxed floating-point model.
这是因为FMA只有一个舍入,而一个ADD + MUL有两个。因此,编译器将通过融合违反严格IEEE浮点行为。
This is because an FMA has only one rounding, while an ADD + MUL has two. So the compiler will violate strict IEEE floating-point behavior by fusing.
即使你能轻松的浮点,编译器可能仍然选择不融合,因为它可能希望你知道你,如果你已经在使用内部函数做什么。
Even if you enable relaxed floating-point, the compiler might still choose not to fuse since it might expect you to know what you're doing if you're already using intrinsics.
所以最好的方式,以确保你真正得到你想要的是你实际使用所提供的内部函数为他们的FMA指令:
So the best way to make sure you actually get the FMA instructions you want is you actually use the provided intrinsics for them:
FMA3内部函数:(AVX2 - 英特尔的Haswell)
FMA3 Intrinsics: (AVX2 - Intel Haswell)
-
_mm_fmadd_pd()
_mm256_fmadd_pd()
-
_mm_fmadd_ps()
,_mm256_fmadd_ps()
- 和大约一个极大其他变化...
_mm_fmadd_pd()
, _mm256_fmadd_pd()
_mm_fmadd_ps()
,_mm256_fmadd_ps()
- and about a gazillion other variations...
(XOP - AMD推土机)
FMA4 Intrinsics: (XOP - AMD Bulldozer)
-
_mm_macc_pd()
,_mm256_macc_pd()
-
_mm_macc_ps()
,_mm256_macc_ps()
- 和大约一个极大其他变化...
这篇关于如何使用乘加(FMA)与SSE / AVX指令的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!