在标量矩阵加法中使用vaddss而不是addss有什么好处？

本文介绍了在标量矩阵加法中使用vaddss而不是addss有什么好处？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！
问题描述

我已经实现了标量矩阵加法内核。
 。 code>（这在AMD上特别有用）。但是，你会比较AVX256与AVX128，而不是AVX256与SSE。 AVX128和SSE都使用128位操作，但其实现方式不同。如果你基准测试你应该提及你使用哪一个。
 
I have implemented scalar matrix addition kernel. 
#include <stdio.h>
#include <time.h>
//#include <x86intrin.h>

//loops and iterations:
#define N 128
#define M N
#define NUM_LOOP 1000000


float   __attribute__(( aligned(32))) A[N][M],
        __attribute__(( aligned(32))) B[N][M],
        __attribute__(( aligned(32))) C[N][M];

int main()
{
int w=0, i, j;
struct timespec tStart, tEnd;//used to record the processiing time
double tTotal , tBest=10000;//minimum of toltal time will asign to the best time
do{
    clock_gettime(CLOCK_MONOTONIC,&tStart);

    for( i=0;i<N;i++){
        for(j=0;j<M;j++){
            C[i][j]= A[i][j] + B[i][j];
        }
    }

    clock_gettime(CLOCK_MONOTONIC,&tEnd);
    tTotal = (tEnd.tv_sec - tStart.tv_sec);
    tTotal += (tEnd.tv_nsec - tStart.tv_nsec) / 1000000000.0;
    if(tTotal<tBest)
        tBest=tTotal;
    } while(w++ < NUM_LOOP);

printf(" The best time: %lf sec in %d repetition for %dX%d matrix\n",tBest,w, N, M);
return 0;
}
In this case, I've compiled the program with different compiler flag and the assembly output of the inner loop is as follows:
gcc -O2 msse4.2:  The best time: 0.000024 sec in 406490 repetition for 128X128 matrix 
movss   xmm1, DWORD PTR A[rcx+rax]
addss   xmm1, DWORD PTR B[rcx+rax]
movss   DWORD PTR C[rcx+rax], xmm1
gcc -O2 -mavx: The best time: 0.000009 sec in 1000001 repetition for 128X128 matrix 
vmovss  xmm1, DWORD PTR A[rcx+rax]
vaddss  xmm1, xmm1, DWORD PTR B[rcx+rax]
vmovss  DWORD PTR C[rcx+rax], xmm1
AVX version gcc -O2 -mavx:
__m256 vec256;
for(i=0;i<N;i++){
    for(j=0;j<M;j+=8){
        vec256 = _mm256_add_ps( _mm256_load_ps(&A[i+1][j]) ,  _mm256_load_ps(&B[i+1][j]));
        _mm256_store_ps(&C[i+1][j], vec256);
            }
        }
SSE version gcc -O2 -sse4.2::
__m128 vec128;
for(i=0;i<N;i++){
    for(j=0;j<M;j+=4){
    vec128= _mm_add_ps( _mm_load_ps(&A[i][j]) ,  _mm_load_ps(&B[i][j]));
    _mm_store_ps(&C[i][j], vec128);
            }
        }
In scalar program the speedup of -mavx over msse4.2 is 2.7x. I know the avx improved the ISA efficiently and it might be because of these improvements. But when I implemented the program in intrinsics for both AVX and SSE the speedup is a factor of 3x. The question is: AVX scalar is 2.7x faster than SSE when I vectorized it the speed up is 3x (matrix size is 128x128 for this question). Does it make any sense While using AVX and SSE in scalar mode yield, a 2.7x speedup. but vectorized method must be better because I process eight elements in AVX compared to four elements in SSE. All programs have less than 4.5% of cache misses as perf stat reported.
using gcc -O2 , linux mint, skylake
UPDATE: Briefly, Scalar-AVX is 2.7x faster than Scalar-SSE but AVX-256  is only 3x faster than SSE-128 while it's vectorized. I think it might be because of pipelining. in scalar I have 3 vec-ALU that might not be useable in vectorized mode. I might compare apples to oranges instead of apples to apples and this might be the point that I can not understand the reason. 
 解决方案 
The problem you are observing is explained here. On Skylake systems if the upper half of an AVX register is dirty then there is false dependency for non-vex encoded SSE operations on the upper half of the AVX register. In your case it seems there is a bug in your version of glibc 2.23. On my Skylake system with Ubuntu 16.10 and glibc 2.24 I don't have the problem.  You can use 
__asm__ __volatile__ ( "vzeroupper" : : : );
to clean the upper half of the AVX register. I don't think you can use an intrinsic such as _mm256_zeroupper to fix this because GCC will say it's SSE code and not recognize the intrinsic. The options -mvzeroupper won't work either because GCC one again thinks it's SSE code and will not emit the vzeroupper instruction. 
BTW, it's Microsoft's fault that the hardware has this problem.
Update:
Other people are apparently encountering this problem on Skylake. It has been observed after printf, memset, and clock_gettime.
If your goal is to compare 128-bit operations with 256-bit operations could consider using -mprefer-avx128 -mavx (which is particularly useful on AMD). But then you would be comparing AVX256 vs AVX128 and not AVX256 vs SSE. AVX128 and SSE both use 128-bit operations but their implementations are different. If you benchmark you should mention which one you used.
                        这篇关于在标量矩阵加法中使用vaddss而不是addss有什么好处？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！