矩阵乘法的自动向量化

本文介绍了矩阵乘法的自动向量化的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我对SIMD还是很陌生，想尝试看看我是否可以让GCC对我进行简单的操作向量化.

I'm fairly new with SIMD and wanted to try to see if I could get GCC to vectorise a simple action for me.

所以我查看了这篇文章，并希望做更多或更少的事情同样的事情. (但对于KabyLake处理器，在Linux 64位上为gcc 5.4.0)

So I looked at this post and wanted to do more or less the same thing. (but with gcc 5.4.0 on Linux 64bit, for a KabyLake processor)

我基本上具有此功能:

/* m1 = N x M matrix, m2 = M x P matrix, m3 = N x P matrix & output */
void mmul(double **m1, double **m2, double **m3, int N, int M, int P)
{
    for (i = 0; i < N; i++)
        for (j = 0; j < P; j++)
        {
            double tmp = 0.0;

            for (k = 0; k < M; k++)
                tmp += m1[i][k] * m2[k][j];

            tmp = m3[i][j];
        }
    return m3;
}

我使用-O2 -ftree-vectorize -msse2 -ftree-vectorizer-verbose=5进行编译，但是我看不到任何有关矢量化已完成的消息.

Which I compile with -O2 -ftree-vectorize -msse2 -ftree-vectorizer-verbose=5, however I don't see any message that the vectorization was done.

如果有人可以帮助我，将不胜感激.

If anyone could help me out, that would be very much appreciated.

推荐答案

您的命令中没有进行矢量化的消息！您可以使用-fopt-info-vec打开矢量化报告.但是，请不要依赖它.编译器有时会说谎(它们会矢量化并报告它，但不使用它！)，您可以检查这些改进.为此，您可以测量加速.首先，禁用矢量化并测量时间t1.然后启用并测量时间t2.如果大于1表示编译器已改进，则加速将为t1/t2；如果小于1表示编译器自动向量化器已损坏，则表示没有改进，则加速将为t1/t2！可以在命令中添加-S并在单独的.s文件中查看汇编代码的另一种方法.

There is no message for vectorization done in you command! You can use -fopt-info-vec to turn the vectorization report on. But, do not rely on it. Compiler sometimes lies (They vectorize and report it but don't use it!) you can chek the improvements!For this purpose, you can measure the speedup. First, disable vectorization and measure the time t1. Then enable and measure the time t2. The speed up will be t1/t2 if it's bigger than 1 it says compiler improved if 1 no improvement if less than one it says compiler auto-vectorizer ruined that for you! Another way you can add -S to your command and see the assembly codes in a separated .s file.

注意::如果要查看自动矢量化功能，请添加-march=native并删除该-msse2.

NOTE: if you want to see the autovectorization power add -march=native and delete that -msse2.

更新:当您使用变量N，M等作为循环计数器时，您可能看不到矢量化.因此，您应该改用constants.以我的经验，矩阵矩阵乘法是使用gcc 4.8, 5.4 and 6.2可矢量化的.其他编译器(例如clang-LLVM，ICC和MSVC)也将其向量化.如注释中所述，如果您使用double或float数据类型，则可能需要使用-ffast-math，这是-Ofast优化级别中的已启用标志，表示您不需要高精度结果(没关系)大部分的时间).这是因为，ompliers更加关心浮点运算.

UPDATE: When you use a variable such a N,M, etc. as the loop counter you might not see vectorization. Thus, you should have used constants instead. In my experience, the matrix-matrix multiplication is vectorizable using gcc 4.8, 5.4 and 6.2. Other compilers such as clang-LLVM, ICC and MSVC vectorize it as well. As mentioned in comments if you use double or float datatypes you might need to use -ffast-math which is an enabled flag in -Ofast optimization level, to say you don't need a high-accuracy result (It's OK most of the times). Its because ompilers are more carful about floting-point operations.

这篇关于矩阵乘法的自动向量化的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！