问题描述
我有一个有AVX2和FMA3的i5-4250U。我正在测试我编写的Linux上GCC 4.8.1中的一些密集矩阵乘法代码。以下是我编译的三种不同方式的列表。 SSE2:gcc matrix.cpp -o matrix_gcc -O3 -msse2 -fopenmp
AVX:gcc matrix.cpp - o matrix_gcc -O3 -mavx -fopenmp
AVX2 + FMA:gcc matrix.cpp -o matrix_gcc -O3 -march = native -fopenmp -ffast-math
SSE2和AVX版本在性能上明显不同。但是,AVX2 + FMA并不比AVX版本好。我不明白这一点。假设没有FMA,我获得了超过80%的CPU峰值触发器,但我认为我应该能够使用FMA做得更好。矩阵乘法应直接受益于FMA。我基本上是在AVX中一次做八个点的产品。当我检查 march = native
时,它会提供:
cc -march = native -E -v - < / dev / null 2>& 1 | grep cc1 | grep fma
...- march = core-avx2 -mavx -mavx2 -mfma -mno-fma4 -msse4.2 -msse4.1 ...
所以我可以看到它已启用(只是为了确保我添加了 -mfma
但它没有区别)。 ffast-math
应该允许轻松的浮点模型
编辑:
基于Mysticial的评论,我继续使用_mm256_fmadd_ps,现在AVX2 + FMA版本更快。 我不确定为什么编译器不会为我做这件事。现在我已经获得大约80 GFLOPS(110%的没有FMA的峰值触发器),用于超过1000x1000的矩阵。如果有人不相信我的峰值触发器计算,这是我做的。
峰值触发器(无FMA)=频率* simd_width * ILP *内核
= 2.3GHZ * 8 * 2 * 2 = 73.2 GFLOPS
峰值触发器(含FMA)= 2 *峰值触发器(无FMA)= 146.2 GFLOPS
使用两个核心时,处于turbo模式的CPU为2.3 GHz。我得到2的ILP,因为Ivy Bridge可以同时做一个AVX乘法和一个AVX加法(为了确保这一点,我已经展开了几次循环)。
我只有55%的高峰失败(与FMA)。我不知道为什么,但至少我现在看到了一些东西。
一个副作用是,当我比较简单的矩阵乘法时,我现在得到一个小错误算法我知道我相信。我认为这是因为FMA只有一种舍入模式,而不是通常是两种模式(即使它可能更好,但它会讽刺地破坏IEEE浮点规则)。
编辑:
有人需要重做
,但Haswell每个周期执行8个双浮点FLOPS。
编辑 p $ p
$ b
实际上,Mysticial更新了他的项目以支持FMA3他在上面的链接中回答)。
我使用MSVC2012在Windows8中运行他的代码(因为Linux版本没有在FMA支持下编译)。以下是结果。
测试AVX Mul +地址:
秒数= 22.7417
FP Ops = 768000000000
FLOPs = 3.37705e + 010
sum = 17.8122
测试FMA3 FMA:
秒= 22.1389
FP Ops = 1536000000000
FLOPs = 6.938e + 010
sum = 333.309
FMA3的双浮点数为69.38 GFLOPS。对于单浮点,我需要加倍,所以这是138.76 SP GFLOPS。我计算我的峰值是146.2 SP GFLOPS。 这是95%的高峰!换句话说,我应该可以提高我的GEMM代码(虽然它已经比Eigen快了很多)。
这里只回答这个问题的一小部分。如果你编写 _mm256_add_ps(_mm256_mul_ps(areg0,breg0),tmp0)
,gcc-4.9几乎就像内联asm一样处理它,并且不会优化它。如果用 areg0 * breg0 + tmp0
替换它,gcc和clang都支持该语法,那么gcc将开始优化并可能使用FMA(如果可用)。我改进了gcc-5, _mm256_add_ps
现在作为一个内联函数实现,它简单地使用 +
,因此具有内在函数的代码也可以进行优化。
I have a i5-4250U which has AVX2 and FMA3. I am testing some dense matrix multiplication code in GCC 4.8.1 on Linux which I wrote. Below is a list of three difference ways I compile.
SSE2: gcc matrix.cpp -o matrix_gcc -O3 -msse2 -fopenmp
AVX: gcc matrix.cpp -o matrix_gcc -O3 -mavx -fopenmp
AVX2+FMA: gcc matrix.cpp -o matrix_gcc -O3 -march=native -fopenmp -ffast-math
The SSE2 and AVX version are clearly different in performance. However, the AVX2+FMA is no better than the AVX version. I don't understand this. I get over 80% of the peak flops of the CPU assuming there is no FMA but I think I should be able to do a lot better with FMA. Matrix Multiplication should benefit directly from FMA. I'm essentially doing eight dot products at once in AVX. When I check march=native
it gives:
cc -march=native -E -v - </dev/null 2>&1 | grep cc1 | grep fma
...-march=core-avx2 -mavx -mavx2 -mfma -mno-fma4 -msse4.2 -msse4.1 ...
So I can see it's enabled (just to be sure I added -mfma
but it makes not difference). ffast-math
should allow a relaxed floating point model How to use Fused Multiply-Add (FMA) instructions with SSE/AVX
Edit:
Based on Mysticial's comments I went ahead and used _mm256_fmadd_ps and now the AVX2+FMA version is faster. I'm not sure why the compiler won't do this for me. I'm now getting about 80 GFLOPS (110% of the peak flops without FMA) for over 1000x1000 matrices. In case anyone does not trust my peak flop calculation here is what I did.
peak flops (no FMA) = frequency * simd_width * ILP * cores
= 2.3GHZ * 8 * 2 * 2 = 73.2 GFLOPS
peak flops (with FMA) = 2 * peak flops (no FMA) = 146.2 GFLOPS
My CPU in turbo mode when using both cores is 2.3 GHz. I get 2 for ILP because Ivy Bridge can do one AVX multiplication and one AVX addition at the same time (and I have unrolled the loop several times to ensure this).
I'm only geting about 55% of the peak flops (with FMA). I'm not sure why but at least I'm seeing something now.
One side effect is that I now get a small error when I compare to a simple matrix multiplication algorithm I know I trust. I think that's due to the fact that FMA only has one rounding mode instead of what would normally be two (which ironically breaks IEEE floating point rules even though it's probably better).
Edit:
Somebody needs to redoHow do I achieve the theoretical maximum of 4 FLOPs per cycle?but do 8 double floating point FLOPS per cycle with Haswell.
Edit
Actually, Mysticial has updated his project to support FMA3 (see his answer in the link above).I ran his code in Windows8 with MSVC2012 (because the Linux version did not compile with FMA support). Here are the results.
Testing AVX Mul + Add:
Seconds = 22.7417
FP Ops = 768000000000
FLOPs = 3.37705e+010
sum = 17.8122
Testing FMA3 FMA:
Seconds = 22.1389
FP Ops = 1536000000000
FLOPs = 6.938e+010
sum = 333.309
That's 69.38 GFLOPS for FMA3 for double floating point. For single floating point I need to double it so that's 138.76 SP GFLOPS. I calculate my peak is 146.2 SP GFLOPS. That's 95% of the peak! In other words I should be able to improve my GEMM code quite a bit (although it's already quite a bit faster than Eigen).
Only answering a very small part of the question here. If you write _mm256_add_ps(_mm256_mul_ps(areg0,breg0), tmp0)
, gcc-4.9 handles it almost like inline asm and does not optimize it much. If you replace it with areg0*breg0+tmp0
, a syntax that is supported by both gcc and clang, then gcc starts optimizing and may use FMA if available. I improved that for gcc-5, _mm256_add_ps
for instance is now implemented as an inline function that simply uses +
, so the code with intrinsics can be optimized as well.
这篇关于GCC中的FMA3:如何启用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!