本文介绍了Visual Studio 2010-2015不使用ymm *寄存器进行AVX优化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的笔记本电脑CPU仅支持AVX(高级矢量扩展名),但不支持AVX2.对于AVX,已经将128位xmm *寄存器扩展为256位ymm *寄存器,以进行浮点运算.但是,我测试了所有版本的Visual Studio(从2010年到2015年)在/arch:AVX优化下都没有使用ymm *寄存器,尽管它们在/arch:AVX2优化下也使用了ymm *寄存器.

My laptop CPU supports only AVX (advanced vector extension) but does not support AVX2. For AVX, the 128-bit xmm* registers have already been extended to the 256-bit ymm* registers for floating point arithmetic. However, I have tested that all versions of Visual Studio (from 2010 to 2015) do not use ymm* registers under /arch:AVX optimization, although they do so under /arch:AVX2 optimization.

下面显示了一个简单的for循环的反汇编.该程序在发布版本中使用/arch:AVX进行编译,并启用了所有优化选项.

The following shows the disassembly for a simple for loop. The program is compiled with /arch:AVX in release build, with all optimization options on.

    float a[10000], b[10000], c[10000];
    for (int x = 0; x < 10000; x++)
1000988F  xor         eax,eax  
10009891  mov         dword ptr [ebp-9C8Ch],ecx  
        c[x] = (a[x] + b[x])*b[x];
10009897  vmovups     xmm1,xmmword ptr c[eax]  
100098A0  vaddps      xmm0,xmm1,xmmword ptr c[eax]  
100098A9  vmulps      xmm0,xmm0,xmm1  
100098AD  vmovups     xmmword ptr c[eax],xmm0  
100098B6  vmovups     xmm1,xmmword ptr [ebp+eax-9C78h]  
100098BF  vaddps      xmm0,xmm1,xmmword ptr [ebp+eax-9C78h]  
100098C8  vmulps      xmm0,xmm0,xmm1  
100098CC  vmovups     xmmword ptr [ebp+eax-9C78h],xmm0  
100098D5  add         eax,20h  
100098D8  cmp         eax,9C40h  
100098DD  jl          ComputeTempo+67h (10009897h)  


    const int   winpts = (int)(window_size*sr+0.5);
100098DF  vxorps      xmm1,xmm1,xmm1  
100098E3  vcvtsi2ss   xmm1,xmm1,ecx  

我还测试了可以使用ymm *寄存器进一步加速程序而不会崩溃.我使用IMM内部函数做到了这一点,例如_mm256_mul_ps.

I have also tested that I can use ymm* registers to further speed up my program without crashing. I did that using IMM intrinsics, e.g. _mm256_mul_ps.

任何Microsoft编译器开发人员都可以提供解释吗?还是这是Visual Studio提供比gcc/g ++编译器慢的代码的原因之一?

Can any Microsoft compiler developer give an explanation? Or maybe that is one of the reasons why Visual Studio gives slower codes than gcc/g++ compiler?

=============编辑=============

=============edited==============

事实证明,原因是在32位计算机上运行32位操作系统与在64位计算机上运行32位操作系统之间存在一些差异.在后一种情况下,某些操作系统可能不知道ymm *寄存器的存在,因此无法在上下文切换期间正确保留上半部分寄存器.因此,如果在64位计算机上的32位OS上使用ymm *寄存器,则在发生上下文切换时,如果另一个程序也在使用ymm *寄存器,则上半部分寄存器可能会被静默破坏.在这种情况下,Visual Studio有点保守.

The reason turns out to be that there exist some difference between running 32-bit OS on 32-bit machine and running 32-bit OS on 64-bit machine. In the latter case, some OS might not know the existence of ymm* registers and thus does not preserve the upper half registers properly during a context switch. Thus, if ymm* registers are used on 32-bit OS on 64-bit machine, if a context switch occurs, the upper half registers might get silently corrupted if another program is also using ymm* registers. Visual Studio is kind of conservative in this context.

推荐答案

我制作了一个文本文件vec.cpp

I made a text file vec.cpp

//vec.cpp
void foo(float *a, float *b, float *c) {
    for (int i = 0; i < 10000; i++) c[i] = (a[i] + b[i])*b[i];
}

在启用并启用了Visual Studio 2015 x86 x64的情况下进入命令行

went to the command line with Visual Studio 2015 x86 x64 enabled and did

cl /c /O2 /arch:AVX /FA vec.cpp

看着文件vec.asm,我看到了

$LL4@foo:
    vmovups ymm0, YMMWORD PTR [rax-32]
    lea rax, QWORD PTR [rax+64]
    vmovups ymm2, ymm0
    vaddps  ymm0, ymm0, YMMWORD PTR [rcx+rax-96]
    vmulps  ymm2, ymm0, ymm2
    vmovups YMMWORD PTR [r8+rax-96], ymm2
    vmovups ymm0, YMMWORD PTR [rax-64]
    vmovups ymm2, ymm0
    vaddps  ymm0, ymm0, YMMWORD PTR [rcx+rax-64]
    vmulps  ymm2, ymm0, ymm2
    vmovups YMMWORD PTR [r8+rax-64], ymm2
    sub rdx, 1
    jne SHORT $LL4@foo
    vzeroupper


问题是您正在32位模式下进行编译.使用上面相同的功能,但在32位模式下编译,我得到了


The problem is that you are compiling in 32-bit mode. Using the same function above but compiling in 32-bit mode I get

$LL4@foo:
    lea eax, DWORD PTR [ebx+esi]
    lea ecx, DWORD PTR [ecx+32]
    lea esi, DWORD PTR [esi+32]
    vmovups xmm1, XMMWORD PTR [esi-48]
    vaddps  xmm0, xmm1, XMMWORD PTR [ecx-32]
    vmulps  xmm0, xmm0, xmm1
    vmovups XMMWORD PTR [edx+ecx-32], xmm0
    vmovups xmm1, XMMWORD PTR [esi-32]
    vaddps  xmm0, xmm1, XMMWORD PTR [eax]
    vmulps  xmm0, xmm0, xmm1
    vmovups XMMWORD PTR [eax+edx], xmm0
    sub edi, 1
    jne SHORT $LL4@foo

这篇关于Visual Studio 2010-2015不使用ymm *寄存器进行AVX优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-17 16:16