问题描述
我对SIMD还是很陌生,想尝试看看我是否可以让GCC对我进行简单的操作向量化.
I'm fairly new with SIMD and wanted to try to see if I could get GCC to vectorise a simple action for me.
所以我查看了这篇文章,并希望做更多或更少的事情同样的事情. (但对于KabyLake处理器,在Linux 64位上为gcc 5.4.0)
So I looked at this post and wanted to do more or less the same thing. (but with gcc 5.4.0 on Linux 64bit, for a KabyLake processor)
我基本上具有此功能:
/* m1 = N x M matrix, m2 = M x P matrix, m3 = N x P matrix & output */
void mmul(double **m1, double **m2, double **m3, int N, int M, int P)
{
for (i = 0; i < N; i++)
for (j = 0; j < P; j++)
{
double tmp = 0.0;
for (k = 0; k < M; k++)
tmp += m1[i][k] * m2[k][j];
tmp = m3[i][j];
}
return m3;
}
我使用-O2 -ftree-vectorize -msse2 -ftree-vectorizer-verbose=5
进行编译,但是我看不到任何有关矢量化已完成的消息.
Which I compile with -O2 -ftree-vectorize -msse2 -ftree-vectorizer-verbose=5
, however I don't see any message that the vectorization was done.
如果有人可以帮助我,将不胜感激.
If anyone could help me out, that would be very much appreciated.
推荐答案
您的命令中没有进行矢量化的消息!您可以使用-fopt-info-vec
打开矢量化报告.但是,请不要依赖它.编译器有时会说谎(它们会矢量化并报告它,但不使用它!),您可以检查这些改进.为此,您可以测量加速.首先,禁用矢量化并测量时间t1.然后启用并测量时间t2.如果大于1表示编译器已改进,则加速将为t1/t2;如果小于1表示编译器自动向量化器已损坏,则表示没有改进,则加速将为t1/t2!可以在命令中添加-S
并在单独的.s
文件中查看汇编代码的另一种方法.
There is no message for vectorization done in you command! You can use -fopt-info-vec
to turn the vectorization report on. But, do not rely on it. Compiler sometimes lies (They vectorize and report it but don't use it!) you can chek the improvements!For this purpose, you can measure the speedup. First, disable vectorization and measure the time t1. Then enable and measure the time t2. The speed up will be t1/t2 if it's bigger than 1 it says compiler improved if 1 no improvement if less than one it says compiler auto-vectorizer ruined that for you! Another way you can add -S
to your command and see the assembly codes in a separated .s
file.
注意::如果要查看自动矢量化功能,请添加-march=native
并删除该-msse2
.
NOTE: if you want to see the autovectorization power add -march=native
and delete that -msse2
.
更新:当您使用变量N
,M
等作为循环计数器时,您可能看不到矢量化.因此,您应该改用constants
.以我的经验,矩阵矩阵乘法是使用gcc 4.8, 5.4 and 6.2
可矢量化的.其他编译器(例如clang-LLVM
,ICC
和MSVC
)也将其向量化.如注释中所述,如果您使用double
或float
数据类型,则可能需要使用-ffast-math
,这是-Ofast
优化级别中的已启用标志,表示您不需要高精度结果(没关系)大部分的时间).这是因为,ompliers更加关心浮点运算.
UPDATE: When you use a variable such a N
,M
, etc. as the loop counter you might not see vectorization. Thus, you should have used constants
instead. In my experience, the matrix-matrix multiplication is vectorizable using gcc 4.8, 5.4 and 6.2
. Other compilers such as clang-LLVM
, ICC
and MSVC
vectorize it as well. As mentioned in comments if you use double
or float
datatypes you might need to use -ffast-math
which is an enabled flag in -Ofast
optimization level, to say you don't need a high-accuracy result (It's OK most of the times). Its because ompilers are more carful about floting-point operations.
这篇关于矩阵乘法的自动向量化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!