本文介绍了ICC中的-O2弄乱了汇编程序,ICC中的-O1和GCC / Clang中的所有优化都很好的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近开始使用ICC(18.0.1.126)来编译可以在任意优化设置下与GCC和Clang一起正常工作的代码。该代码包含一个汇编程序例程,该例程使用AVX2和FMA指令将4x4的double矩阵相乘。经过多番摆弄之后,事实证明,使用-O1-xcore-avx2进行编译时,汇编程序正常运行,但是使用-O2-xcore-avx2进行编译时,给出了错误的数值结果。但是,该代码可以在所有优化设置上编译,而不会出现任何错误消息。它运行于2015年初的采用Broadwell核心i5的MacBook Air上。并使用汇编器/内部函数。他们所有人都是同一个问题。



我将指向4x4双精度数组的第一个元素的指针传递给例程,该指针被创建为
double MatrixDummy [4] [4];
并以
(& MatrixDummy)[0] [0]



的形式传递给例程:

  // Routine乘以4x4矩阵A * B并将结果存储在C 
内联void RunAssembler_FMA_UnalignedCopy_MultiplyMatrixByMatrix(double * A,double * B,双* C)
{
__asm__ __volatile__( vmovupd%0,%% ymm0 \n\t
vmovupd%1,%% ymm1 \n\ t
vmovupd%2,%% ymm2 \n\t
vmovupd%3,%% ymm3


m (B [0]),
m(B [4]),
m(B [8]),
m(B [12])

ymm0, ymm1, ymm2, ymm3);


__asm__ __volatile__( vbroadcastsd%1,%% ymm4 \nt
vbroadcastsd%2,%% ymm5 \n\t
vbroadcastsd%3,%% ymm6 \n\t
vbroadcastsd%4,%% ymm7 \n\t
vmulpd %% ymm4,%% ymm0 ,%% ymm8 \n\t
vfmadd231PD %% ymm5,%% ymm1,%% ymm8 \n\t
vfmadd231PD %% ymm6,%% ymm2,% %ymm8 \n\t
vfmadd231PD %% ymm7,%% ymm3,%% ymm8 \n\t
vmovupd %% ymm8,%0

= m(C [0])

m(A [0]),
m(A [1]),
m(A [2]),
m(A [3])

ymm4, ymm5, ymm6, ymm7, ymm8);

__asm__ __volatile__( vbroadcastsd%1,%% ymm4 \n\t
vbroadcastsd%2,%% ymm5 \n\t
vbroadcastsd%3,%% ymm6 \n\t
vbroadcastsd%4,%% ymm7 \n\t
vmulpd %% ymm4,%% ymm0,%% ymm8 \n\t
vfmadd231PD %% ymm5,%% ymm1,%% ymm8 \n\t
vfmadd231PD %% ymm6,%% ymm2,%% ymm8 \ n\t
vfmadd231PD %% ymm7,%% ymm3,%% ymm8 \n\t
vmovupd %% ymm8,%0

= m(C [4])

m(A [4]),
m(A [5]),
m (A [6]),
m(A [7])

ymm4, ymm5, ymm6, ymm7, ymm8) ;

__asm__ __volatile__( vbroadcastsd%1,%% ymm4 \n\t
vbroadcastsd%2,%% ymm5 \n\t
vbroadcastsd%3,%% ymm6 \n\t
vbroadcastsd%4,%% ymm7 \n\t
vmulpd %% ymm4,%% ymm0,%% ymm8 \n\t
vfmadd231PD %% ymm5,%% ymm1,%% ymm8 \n\t
vfmadd231PD %% ymm6,%% ymm2,%% ymm8 \ n\t
vfmadd231PD %% ymm7,%% ymm3,%% ymm8 \n\t
vmovupd %% ymm8,%0

= m(C [8])

m(A [8]),
m(A [9]),
m (A [10]),
m(A [11])$ ​​b $ b:
ymm4, ymm5, ymm6, ymm7, ymm8) ;


__asm__ __volatile__( vbroadcastsd%1,%% ymm4 \nt
vbroadcastsd%2,%% ymm5 \n\t
vbroadcastsd%3,%% ymm6 \n\t
vbroadcastsd%4,%% ymm7 \n\t
vmulpd %% ymm4,%% ymm0 ,%% ymm8 \n\t
vfmadd231PD %% ymm5,%% ymm1,%% ymm8 \n\t
vfmadd231PD %% ymm6,%% ymm2,% %ymm8 \n\t
vfmadd231PD %% ymm7,%% ymm3,%% ymm8 \n\t
vmovupd %% ymm8,%0

= m(C [12])

m(A [12]),
m(A [13]),
m(A [14]),
m(A [15])

ymm4, ymm5, ymm6, ymm7, ymm8);

}

作为比较,下面的代码应该完全一样,并且使用所有编译器/优化设置来执行此操作。由于如果我使用此例程而不是汇编程序,则一切正常,因此我希望错误必须出在ICC如何使用-O2优化处理汇编程序例程上。

 行内无效Run3ForLoops_MultiplyMatrixByMatrix_OutputTo3(double * A,double * B,double * C){

int i,j,k;

double dummy [4] [4];

for(j = 0; j< 4; j ++){
for(k = 0; k dummy [j] [k] = 0.0;
for(i = 0; I< 4; i ++){
dummy [j] [k] + = *(A + j * 4 + i)*(*(B + i * 4 + k));
}
}
}

for(j = 0; j< 4; j ++){
for(k = 0; k< 4; k ++){
*(C + j * 4 + k)=虚拟[j] [k];
}
}


}

有什么想法吗?我真的很困惑。

解决方案

代码的核心问题是假设,如果您向寄存器中写入一个值,那么该值仍将在下一条语句中。这个假设是错误的。在 asm 语句之间,编译器可以根据需要使用任何寄存器。例如,它可能决定使用 ymm0 在您的语句之间将变量从一个位置复制到另一位置,从而破坏其先前的内容。



进行内联汇编的正确方法是,如果没有充分的理由,则永远不要直接引用寄存器。您要在汇编语句之间保留的每个值都需要使用适当的操作数放置在变量中。 对此非常清楚。



例如,让我重写您的代码以使用正确的内联汇编:

  #include < immintrin.h> 

内联void RunAssembler_FMA_UnalignedCopy_MultiplyMatrixByMatrix(double * A,double * B,double * C)
{
size_t i;

/ *您使用的寄存器* /
__m256 a0,a1,a2,a3,b0,b1,b2,b3,sum;
__m256 * B256 =(__m256 *)B,* C256 =(__m256 *)C;

/ *从B * /
asm中加载值( vmovupd%1,%0: = x(b0): m(B256 [0]));
asm( vmovupd%1,%0: = x(b1): m(B256 [1]));
asm( vmovupd%1,%0: = x(b2): m(B256 [2]));
asm( vmovupd%1,%0: = x(b3): m(B256 [3]));

for(i = 0; i< 4; i ++){
/ *从A * /
asm加载值( vbroadcastsd%1,%0: = x(a0): m(A [4 * i + 0]));
asm( vbroadcastsd%1,%0: = x(a1): m(A [4 * i + 1]));
asm( vbroadcastsd%1,%0: = x(a2): m(A [4 * i + 2]));
asm( vbroadcastsd%1,%0: = x(a3): m(A [4 * i + 3]));

asm( vmulpd%2,%1,%0: = x(sum): x(a0), x(b0));
asm( vfmadd231pd%2,%1,%0: + x(sum): x(a1), x(b1));
asm( vfmadd231pd%2,%1,%0: + x(sum): x(a2), x(b2));
asm( vfmadd231pd%2,%1,%0: + x(sum): x(a3), x(b3));
asm( vmovupd%1,%0: = m(C256 [i]): x(sum));
}
}

有很多事情您应该立即注意:




  • 我们使用的每个寄存器都是通过asm操作数来抽象描述的。

  • 我们保存的所有值都是绑定的到局部变量,以便编译器可以跟踪正在使用的寄存器以及哪些寄存器可能被破坏

  • 因为asm语句的所有依赖关系和副作用都是通过操作数明确描述的,所以没有 volatile 限定符是必需的,编译器可以更好地优化代码



您应该真正考虑使用内在函数,因为与内联汇编相比,编译器可以对内在函数进行更多的优化。这是因为编译器在某种程度上了解内在函数的作用,并且可以使用此知识来生成更好的代码。


I was recently starting to use ICC (18.0.1.126) to compile a code that worked fine with GCC and Clang on arbitrary optimization settings. The code contains an assembler routine that multiplies 4x4 matrices of doubles using AVX2 and FMA instructions. After much fiddling it turned out that the assembler routine is working properly when compiled with -O1 - xcore-avx2, but gives a wrong numerical result when compiled with -O2 - xcore-avx2. The code compiles however without any error messages on all optimization settings. It runs on an early 2015 MacBook Air with Broadwell core i5.

I also have several versions of the 4x4 matrix multiplication routine originally written for speed testing, with / without FMA, and using assembler / intrinsics. It is the same problem for all of them.

I pass to the routine a pointer to the first element of a 4x4 array of doubles which was created as double MatrixDummy[4][4];and passed to the routine as (&MatrixDummy)[0][0]

The assembler routine is here:

//Routine multiplies the 4x4 matrices A * B and store the result in C
inline void RunAssembler_FMA_UnalignedCopy_MultiplyMatrixByMatrix(double *A, double *B, double *C)
{
__asm__ __volatile__  ("vmovupd %0, %%ymm0 \n\t"
                       "vmovupd %1, %%ymm1 \n\t"
                       "vmovupd %2, %%ymm2 \n\t"
                       "vmovupd %3, %%ymm3"
                       :
                       :
                       "m" (B[0]),
                       "m" (B[4]),
                       "m" (B[8]),
                       "m" (B[12])
                       :
                       "ymm0", "ymm1", "ymm2", "ymm3");


__asm__ __volatile__ ("vbroadcastsd %1, %%ymm4 \n\t"
                      "vbroadcastsd %2, %%ymm5 \n\t"
                      "vbroadcastsd %3, %%ymm6 \n\t"
                      "vbroadcastsd %4, %%ymm7 \n\t"
                      "vmulpd %%ymm4, %%ymm0, %%ymm8 \n\t"
                      "vfmadd231PD %%ymm5, %%ymm1, %%ymm8 \n\t"
                      "vfmadd231PD %%ymm6, %%ymm2, %%ymm8 \n\t"
                      "vfmadd231PD %%ymm7, %%ymm3, %%ymm8 \n\t"
                      "vmovupd %%ymm8, %0"
                      :
                      "=m" (C[0])
                      :
                      "m" (A[0]),
                      "m" (A[1]),
                      "m" (A[2]),
                      "m" (A[3])
                      :
                      "ymm4", "ymm5", "ymm6", "ymm7", "ymm8");

__asm__ __volatile__ ("vbroadcastsd %1, %%ymm4 \n\t"
                      "vbroadcastsd %2, %%ymm5 \n\t"
                      "vbroadcastsd %3, %%ymm6 \n\t"
                      "vbroadcastsd %4, %%ymm7 \n\t"
                      "vmulpd %%ymm4, %%ymm0, %%ymm8 \n\t"
                      "vfmadd231PD %%ymm5, %%ymm1, %%ymm8 \n\t"
                      "vfmadd231PD %%ymm6, %%ymm2, %%ymm8 \n\t"
                      "vfmadd231PD %%ymm7, %%ymm3, %%ymm8 \n\t"
                      "vmovupd %%ymm8, %0"
                      :
                      "=m" (C[4])
                      :
                      "m" (A[4]),
                      "m" (A[5]),
                      "m" (A[6]),
                      "m" (A[7])
                      :
                      "ymm4", "ymm5", "ymm6", "ymm7", "ymm8");

__asm__ __volatile__ ("vbroadcastsd %1, %%ymm4 \n\t"
                      "vbroadcastsd %2, %%ymm5 \n\t"
                      "vbroadcastsd %3, %%ymm6 \n\t"
                      "vbroadcastsd %4, %%ymm7 \n\t"
                      "vmulpd %%ymm4, %%ymm0, %%ymm8 \n\t"
                      "vfmadd231PD %%ymm5, %%ymm1, %%ymm8 \n\t"
                      "vfmadd231PD %%ymm6, %%ymm2, %%ymm8 \n\t"
                      "vfmadd231PD %%ymm7, %%ymm3, %%ymm8 \n\t"
                      "vmovupd %%ymm8, %0"
                      :
                      "=m" (C[8])
                      :
                      "m" (A[8]),
                      "m" (A[9]),
                      "m" (A[10]),
                      "m" (A[11])
                      :
                      "ymm4", "ymm5", "ymm6", "ymm7", "ymm8");


__asm__ __volatile__ ("vbroadcastsd %1, %%ymm4 \n\t"
                      "vbroadcastsd %2, %%ymm5 \n\t"
                      "vbroadcastsd %3, %%ymm6 \n\t"
                      "vbroadcastsd %4, %%ymm7 \n\t"
                      "vmulpd %%ymm4, %%ymm0, %%ymm8 \n\t"
                      "vfmadd231PD %%ymm5, %%ymm1, %%ymm8 \n\t"
                      "vfmadd231PD %%ymm6, %%ymm2, %%ymm8 \n\t"
                      "vfmadd231PD %%ymm7, %%ymm3, %%ymm8 \n\t"
                      "vmovupd %%ymm8, %0"
                      :
                      "=m" (C[12])
                      :
                      "m" (A[12]),
                      "m" (A[13]),
                      "m" (A[14]),
                      "m" (A[15])
                      :
                      "ymm4", "ymm5", "ymm6", "ymm7", "ymm8");

}

As a comparison, the following code is supposed to do the exact same thing, and does so using all compilers / optimization settings. Since everything works if I use this routine instead of the assembler one, I would expect that the error has to be in how ICC handles the assembler routine with -O2 optimization.

inline void Run3ForLoops_MultiplyMatrixByMatrix_OutputTo3(double *A, double *B, double *C){

int i, j, k;

double dummy[4][4];

for(j=0; j<4; j++) {
    for(k=0; k<4; k++) {
        dummy[j][k] = 0.0;
        for(i=0; I<4; i++) {
            dummy[j][k] += *(A+j*4+i)*(*(B+i*4+k));
        }
    }
}

for(j=0; j<4; j++) {
    for(k=0; k<4; k++) {
        *(C+j*4+k) = dummy[j][k];
    }
}


}

Any ideas? I am really confused.

解决方案

The core problem with your code is the assumption that if you write a value to a register, that value is still going to be there in the next statement. This assumption is wrong. Between asm statements, the compiler may use any register as it likes. For example, it may decide to use ymm0 to copy a variable from one location to another between your statements, trashing its previous content.

The correct way to do inline assembly is to never ever refer to registers directly if there is no good reason to do so. Each value you want to keep between assembly statements needs to be placed in a variable using an appropriate operand. The manual is quite clear about this.

As an example, let me rewrite your code to use correct inline assembly:

#include <immintrin.h>

inline void RunAssembler_FMA_UnalignedCopy_MultiplyMatrixByMatrix(double *A, double *B, double *C)
{
    size_t i;

    /* the registers you use */
    __m256 a0, a1, a2, a3, b0, b1, b2, b3, sum;
    __m256 *B256 = (__m256 *)B, *C256 = (__m256 *)C;

    /* load values from B */
    asm ("vmovupd %1, %0" : "=x"(b0) : "m"(B256[0]));
    asm ("vmovupd %1, %0" : "=x"(b1) : "m"(B256[1]));
    asm ("vmovupd %1, %0" : "=x"(b2) : "m"(B256[2]));
    asm ("vmovupd %1, %0" : "=x"(b3) : "m"(B256[3]));

    for (i = 0; i < 4; i++) {
        /* load values from A */
        asm ("vbroadcastsd %1, %0" : "=x"(a0) : "m"(A[4 * i + 0]));
        asm ("vbroadcastsd %1, %0" : "=x"(a1) : "m"(A[4 * i + 1]));
        asm ("vbroadcastsd %1, %0" : "=x"(a2) : "m"(A[4 * i + 2]));
        asm ("vbroadcastsd %1, %0" : "=x"(a3) : "m"(A[4 * i + 3]));

        asm ("vmulpd %2, %1, %0"      : "=x"(sum) : "x"(a0), "x"(b0));
        asm ("vfmadd231pd %2, %1, %0" : "+x"(sum) : "x"(a1), "x"(b1));
        asm ("vfmadd231pd %2, %1, %0" : "+x"(sum) : "x"(a2), "x"(b2));
        asm ("vfmadd231pd %2, %1, %0" : "+x"(sum) : "x"(a3), "x"(b3));
        asm ("vmovupd %1, %0" : "=m"(C256[i]) : "x"(sum));
    }
}

There are a bunch of things you should immediately notice:

  • every register we use is described abstractly through an asm operand
  • all values we save are tied to local variables so the compiler can keep track of which registers are in use and which registers can be clobbered
  • since all dependencies and side effects of the asm statements are explicitly described through operands, no volatile qualifier is necessary and the compiler can optimize the code much better

Still, you should really consider using intrinsics instead as the compilers can do many more optimizations with intrinsics than they can with inline assembly. This is because the compiler to some extent understands what the intrinsic does and can use this knowledge to generate better code.

这篇关于ICC中的-O2弄乱了汇编程序,ICC中的-O1和GCC / Clang中的所有优化都很好的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-06 19:37