本文介绍了同样的AVX指令集代码Intel Core和AMD Ryzen性能差距巨大的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用AVX指令集来加速从8通道图像到8通道图像的卷积操作.我使用 3x3 卷积核.我的代码如下:

I want to use the AVX instruction set to accelerate the convolution operation from 8-channel image to 8-channel image. I use a 3x3 convolution kernel. My code is as follows:

        const float* kptr = kernels;
        const float* bptr = biases;

        __m256 _out0 = _mm256_loadu_ps(bptr);
        __m256 _out1 = _mm256_setzero_ps();
        __m256 _out2 = _mm256_setzero_ps();

        for (int i = 0; i < 8; i ++)
        {
            const __m256 _r00 = _mm256_broadcast_ss(tl + i);
            const __m256 _r01 = _mm256_broadcast_ss(tc + i);
            const __m256 _r02 = _mm256_broadcast_ss(tr + i);
            const __m256 _r03 = _mm256_broadcast_ss(ml + i);
            const __m256 _r04 = _mm256_broadcast_ss(mc + i);
            const __m256 _r05 = _mm256_broadcast_ss(mr + i);
            const __m256 _r06 = _mm256_broadcast_ss(bl + i);
            const __m256 _r07 = _mm256_broadcast_ss(bc + i);
            const __m256 _r08 = _mm256_broadcast_ss(br + i);

            const __m256 _k00 = _mm256_loadu_ps(kptr + i * 72);
            const __m256 _k01 = _mm256_loadu_ps(kptr + i * 72 + 8);
            const __m256 _k02 = _mm256_loadu_ps(kptr + i * 72 + 16);
            const __m256 _k03 = _mm256_loadu_ps(kptr + i * 72 + 24);
            const __m256 _k04 = _mm256_loadu_ps(kptr + i * 72 + 32);
            const __m256 _k05 = _mm256_loadu_ps(kptr + i * 72 + 40);
            const __m256 _k06 = _mm256_loadu_ps(kptr + i * 72 + 48);
            const __m256 _k07 = _mm256_loadu_ps(kptr + i * 72 + 56);
            const __m256 _k08 = _mm256_loadu_ps(kptr + i * 72 + 64);

            _out0 = _mm256_fmadd_ps(_r00, _k00, _out0);
            _out1 = _mm256_fmadd_ps(_r01, _k01, _out1);
            _out2 = _mm256_fmadd_ps(_r02, _k02, _out2);
            _out0 = _mm256_fmadd_ps(_r03, _k03, _out0);
            _out1 = _mm256_fmadd_ps(_r04, _k04, _out1);
            _out2 = _mm256_fmadd_ps(_r05, _k05, _out2);
            _out0 = _mm256_fmadd_ps(_r06, _k06, _out0);
            _out1 = _mm256_fmadd_ps(_r07, _k07, _out1);
            _out2 = _mm256_fmadd_ps(_r08, _k08, _out2);
        }
        _out0 = _mm256_max_ps(_mm256_add_ps(_out0, _mm256_add_ps(_out1, _out2)), _mm256_setzero_ps());

        _mm256_storeu_ps(outMat, _out0);

在锐龙上,这非常有效.在 R5 2600 和 R5 3500U 上测试,与经过编译器优化的普通 C++ 代码相比,我可以获得 2-4 倍的性能提升.但在英特尔酷睿 CPU 上,在 i7 8750H 和 i3 4170 上测试,它甚至比带有编译器优化的普通 C++ 代码慢 50%.实际上,在这种情况下,3500U 比 i7 8750H 快 4 倍.

On Ryzen, this is very effective. Tested on R5 2600 and R5 3500U, I can get 2-4 times performance improvement compared to ordinary C++ code with compiler optimization . But on Intel Core CPU, It is even 50% slower than ordinary C++ code with compiler optimization , tested on i7 8750H and i3 4170, both of them. Actually, 3500U is 4 times faster than i7 8750H in this case.

我对此感到困惑.我发现Intel CPU中最耗时的指令是fmadd指令,但是用等效的avx指令替换fmadd后仍然没有任何改进.

I am confused about this. I found that the most time-consuming instruction in Intel CPU is the fmadd instruction, but it still have no improvement after replacing fmadd with the equivalent avx instruction.

我也考虑过寄存器数量的限制,但是在尝试减少__mm256变量的数量后,情况可能会变得更糟.

I also considered the limitation of the number of registers, but after trying to reduce the number of __mm256 variables, the situation may get worse.

编译器和参数都一样,我是用msvc2019编译的,我什至用了同样的二进制文件.

The compiler and parameters are the same, I compiled with msvc2019, and I even used the same binary.

权重(kptr)的内存布局是CHWB,输入图像像素(tl to br)是BHWC.

The memory layout of weights(kptr) is CHWB, input image pixels(tl to br) is BHWC.

在测试过程中,我注意到在同样的场景下,i7 8750h是满载,而2600是35%左右,性能是前者的8倍.

During the test, I noticed that in the same scenario, i7 8750h is full load, while the 2600 is about 35%, and the performance is 8 times that of the former.

有什么建议吗?

MSVC编译的二进制我没有找到反汇编的好方法,所以我在Linux下编译,用GDB反汇编.这是我使用 GDB 反汇编得到的:

I didn't find a good way to disassemble the binary compiled by MSVC, so I compiled it under Linux and disassembled it with GDB. Here's what I got using GDB disassembly:

-g -fopenmp -lpthread -mavx2 -mfma -O3

/usr/lib/gcc/x86_64-linux-gnu/9/include/avxintrin.h:
898       return *(__m256_u *)__P;
   0x00007fffff710967 <+135>:   vxorps %xmm1,%xmm1,%xmm1
   0x00007fffff71096b <+139>:   lea    0x4(,%r13,4),%r13
   0x00007fffff710973 <+147>:   lea    0x4(,%rdi,4),%rdi
   0x00007fffff71097b <+155>:   vmovaps %ymm1,%ymm3
   0x00007fffff71097f <+159>:   mov    (%rax),%r10
   0x00007fffff710982 <+162>:   mov    0x10(%r9),%rax
   0x00007fffff710986 <+166>:   lea    0x4(,%rsi,4),%rsi
   0x00007fffff71098e <+174>:   lea    (%r11,%rdi,1),%rbx
   0x00007fffff710992 <+178>:   lea    (%r11,%rsi,1),%r12
   0x00007fffff710996 <+182>:   lea    (%rdx,%rsi,1),%r9
   0x00007fffff71099a <+186>:   add    %r13,%r11
   0x00007fffff71099d <+189>:   add    %rcx,%rsi
   0x00007fffff7109a0 <+192>:   mov    (%rax),%rax
   0x00007fffff7109a3 <+195>:   vmovups (%r10),%xmm7
   0x00007fffff7109a8 <+200>:   vinsertf128 $0x1,0x10(%r10),%ymm7,%ymm0

/home/tianzer/Anime4KCPP/Anime4KCore/src/CPUCNNProcessor.cpp:
390             for (int i = 0; i < 8; i += 2)
=> 0x00007fffff7109af <+207>:   lea    (%rdx,%rdi,1),%r10
   0x00007fffff7109b3 <+211>:   add    %r13,%rdx
   0x00007fffff7109b6 <+214>:   add    %rcx,%rdi
   0x00007fffff7109b9 <+217>:   add    %r13,%rcx
   0x00007fffff7109bc <+220>:   lea    0x900(%rax),%r13

/usr/lib/gcc/x86_64-linux-gnu/9/include/avxintrin.h:
735       return (__m256) __builtin_ia32_vbroadcastss256 (__X);
   0x00007fffff7109c3 <+227>:   vbroadcastss -0x4(%rbx),%ymm11
   0x00007fffff7109c9 <+233>:   vmovups (%rax),%xmm5
   0x00007fffff7109cd <+237>:   add    $0x8,%rbx
   0x00007fffff7109d1 <+241>:   add    $0x240,%rax
   0x00007fffff7109d7 <+247>:   vbroadcastss -0x4(%r11),%ymm6
   0x00007fffff7109dd <+253>:   vbroadcastss -0x4(%r9),%ymm8
   0x00007fffff7109e3 <+259>:   add    $0x8,%r12
   0x00007fffff7109e7 <+263>:   add    $0x8,%r11
   0x00007fffff7109eb <+267>:   vbroadcastss -0x4(%rdx),%ymm7
   0x00007fffff7109f1 <+273>:   vbroadcastss -0x4(%rsi),%ymm4
   0x00007fffff7109f7 <+279>:   add    $0x8,%r10
   0x00007fffff7109fb <+283>:   add    $0x8,%r9
   0x00007fffff7109ff <+287>:   vbroadcastss -0xc(%r12),%ymm10
   0x00007fffff710a06 <+294>:   vbroadcastss -0xc(%r10),%ymm9
   0x00007fffff710a0c <+300>:   add    $0x8,%rdx
   0x00007fffff710a10 <+304>:   add    $0x8,%rdi
   0x00007fffff710a14 <+308>:   vbroadcastss -0x4(%rcx),%ymm2
   0x00007fffff710a1a <+314>:   vbroadcastss -0xc(%rdi),%ymm12

/usr/lib/gcc/x86_64-linux-gnu/9/include/fmaintrin.h:
65        return (__m256)__builtin_ia32_vfmaddps256 ((__v8sf)__A, (__v8sf)__B,
   0x00007fffff710a20 <+320>:   add    $0x8,%rsi
   0x00007fffff710a24 <+324>:   add    $0x8,%rcx
   0x00007fffff710a28 <+328>:   vinsertf128 $0x1,-0x230(%rax),%ymm5,%ymm5
   0x00007fffff710a32 <+338>:   vfmadd231ps %ymm5,%ymm11,%ymm0
   0x00007fffff710a37 <+343>:   vmovups -0x220(%rax),%xmm5
   0x00007fffff710a3f <+351>:   vinsertf128 $0x1,-0x210(%rax),%ymm5,%ymm5
   0x00007fffff710a49 <+361>:   vfmadd231ps %ymm5,%ymm10,%ymm3
   0x00007fffff710a4e <+366>:   vmovups -0x200(%rax),%xmm5
   0x00007fffff710a56 <+374>:   vinsertf128 $0x1,-0x1f0(%rax),%ymm5,%ymm5
   0x00007fffff710a60 <+384>:   vfmadd231ps %ymm5,%ymm6,%ymm1
   0x00007fffff710a65 <+389>:   vmovups -0x1e0(%rax),%xmm6
   0x00007fffff710a6d <+397>:   vinsertf128 $0x1,-0x1d0(%rax),%ymm6,%ymm11
   0x00007fffff710a77 <+407>:   vmovups -0x1c0(%rax),%xmm6
   0x00007fffff710a7f <+415>:   vinsertf128 $0x1,-0x1b0(%rax),%ymm6,%ymm10
   0x00007fffff710a89 <+425>:   vfmadd132ps %ymm11,%ymm0,%ymm9
   0x00007fffff710a8e <+430>:   vfmadd132ps %ymm10,%ymm3,%ymm8
   0x00007fffff710a93 <+435>:   vmovups -0x1a0(%rax),%xmm3
   0x00007fffff710a9b <+443>:   vinsertf128 $0x1,-0x190(%rax),%ymm3,%ymm6
   0x00007fffff710aa5 <+453>:   vfmadd132ps %ymm6,%ymm1,%ymm7
   0x00007fffff710aaa <+458>:   vmovups -0x180(%rax),%xmm1
   0x00007fffff710ab2 <+466>:   vinsertf128 $0x1,-0x170(%rax),%ymm1,%ymm5
   0x00007fffff710abc <+476>:   vmovups -0x160(%rax),%xmm1
   0x00007fffff710ac4 <+484>:   vinsertf128 $0x1,-0x150(%rax),%ymm1,%ymm3
   0x00007fffff710ace <+494>:   vfmadd132ps %ymm5,%ymm9,%ymm12
   0x00007fffff710ad3 <+499>:   vbroadcastss -0x8(%r11),%ymm1
   0x00007fffff710ad9 <+505>:   vbroadcastss -0x8(%r10),%ymm5
   0x00007fffff710adf <+511>:   vfmadd132ps %ymm3,%ymm8,%ymm4
   0x00007fffff710ae4 <+516>:   vbroadcastss -0x8(%r12),%ymm3
   0x00007fffff710aeb <+523>:   vmovaps %ymm12,%ymm11
   0x00007fffff710af0 <+528>:   vmovaps %ymm4,%ymm10
   0x00007fffff710af4 <+532>:   vmovups -0x140(%rax),%xmm4
   0x00007fffff710afc <+540>:   vinsertf128 $0x1,-0x130(%rax),%ymm4,%ymm0
   0x00007fffff710b06 <+550>:   vbroadcastss -0x8(%r9),%ymm4
   0x00007fffff710b0c <+556>:   vfmadd132ps %ymm0,%ymm7,%ymm2
   0x00007fffff710b11 <+561>:   vbroadcastss -0x8(%rbx),%ymm0
   0x00007fffff710b17 <+567>:   vmovaps %ymm2,%ymm6

/usr/lib/gcc/x86_64-linux-gnu/9/include/avxintrin.h:
735       return (__m256) __builtin_ia32_vbroadcastss256 (__X);
   0x00007fffff710b1b <+571>:   vbroadcastss -0x8(%rdx),%ymm2
   0x00007fffff710b21 <+577>:   vbroadcastss -0x8(%rsi),%ymm8
   0x00007fffff710b27 <+583>:   vmovups -0x120(%rax),%xmm13
   0x00007fffff710b2f <+591>:   vmovups -0x100(%rax),%xmm14
   0x00007fffff710b37 <+599>:   vinsertf128 $0x1,-0x110(%rax),%ymm13,%ymm12
   0x00007fffff710b41 <+609>:   vmovups -0xe0(%rax),%xmm15
   0x00007fffff710b49 <+617>:   vbroadcastss -0x8(%rdi),%ymm9
   0x00007fffff710b4f <+623>:   vbroadcastss -0x8(%rcx),%ymm7

/usr/lib/gcc/x86_64-linux-gnu/9/include/fmaintrin.h:
65        return (__m256)__builtin_ia32_vfmaddps256 ((__v8sf)__A, (__v8sf)__B,
   0x00007fffff710b55 <+629>:   vfmadd132ps %ymm12,%ymm11,%ymm0
   0x00007fffff710b5a <+634>:   vinsertf128 $0x1,-0xf0(%rax),%ymm14,%ymm11
   0x00007fffff710b64 <+644>:   vfmadd132ps %ymm11,%ymm10,%ymm3
   0x00007fffff710b69 <+649>:   vinsertf128 $0x1,-0xd0(%rax),%ymm15,%ymm10
   0x00007fffff710b73 <+659>:   vfmadd132ps %ymm10,%ymm6,%ymm1
   0x00007fffff710b78 <+664>:   vmovups -0xc0(%rax),%xmm6
   0x00007fffff710b80 <+672>:   vinsertf128 $0x1,-0xb0(%rax),%ymm6,%ymm6
   0x00007fffff710b8a <+682>:   vfmadd132ps %ymm6,%ymm0,%ymm5
   0x00007fffff710b8f <+687>:   vmovups -0xa0(%rax),%xmm6
   0x00007fffff710b97 <+695>:   vinsertf128 $0x1,-0x90(%rax),%ymm6,%ymm0
   0x00007fffff710ba1 <+705>:   vfmadd132ps %ymm0,%ymm3,%ymm4
   0x00007fffff710ba6 <+710>:   vmovups -0x80(%rax),%xmm3
   0x00007fffff710bab <+715>:   vmovaps %ymm9,%ymm0
   0x00007fffff710baf <+719>:   vinsertf128 $0x1,-0x70(%rax),%ymm3,%ymm6
   0x00007fffff710bb6 <+726>:   vmovups -0x40(%rax),%xmm3
   0x00007fffff710bbb <+731>:   vinsertf128 $0x1,-0x30(%rax),%ymm3,%ymm3
   0x00007fffff710bc2 <+738>:   vfmadd132ps %ymm6,%ymm1,%ymm2
   0x00007fffff710bc7 <+743>:   vmovups -0x60(%rax),%xmm1
   0x00007fffff710bcc <+748>:   vinsertf128 $0x1,-0x50(%rax),%ymm1,%ymm6
   0x00007fffff710bd3 <+755>:   vfmadd132ps %ymm6,%ymm5,%ymm0
   0x00007fffff710bd8 <+760>:   vfmadd132ps %ymm8,%ymm4,%ymm3
   0x00007fffff710bdd <+765>:   vmovups -0x20(%rax),%xmm4
   0x00007fffff710be2 <+770>:   vinsertf128 $0x1,-0x10(%rax),%ymm4,%ymm1
   0x00007fffff710be9 <+777>:   vfmadd132ps %ymm7,%ymm2,%ymm1

/home/tianzer/Anime4KCPP/Anime4KCore/src/CPUCNNProcessor.cpp:
390             for (int i = 0; i < 8; i += 2)
   0x00007fffff710bee <+782>:   cmp    %rax,%r13
   0x00007fffff710bf1 <+785>:   jne    0x7fffff7109c3 <std::_Function_handler<void(int, int, float*, float*), Anime4KCPP::CPU::CNNProcessor::conv8To8(const FP*, const FP*, cv::Mat&)::<lambda(int, int, Anime4KCPP::CPU::ChanFP, Anime4KCPP::CPU::LineFP)> >::_M_invoke(const std::_Any_data &, int &&, int &&, float *&&, float *&&)+227>

/usr/lib/gcc/x86_64-linux-gnu/9/include/avxintrin.h:
1230      return __extension__ (__m256){ 0.0, 0.0, 0.0, 0.0,
   0x00007fffff710bf7 <+791>:   vaddps %ymm3,%ymm0,%ymm0
   0x00007fffff710bfb <+795>:   vaddps %ymm1,%ymm0,%ymm0
   0x00007fffff710bff <+799>:   vxorps %xmm1,%xmm1,%xmm1
   0x00007fffff710c03 <+803>:   vmaxps %ymm1,%ymm0,%ymm0

904       *(__m256_u *)__P = __A;
   0x00007fffff710c07 <+807>:   vmovups %xmm0,(%r8)
   0x00007fffff710c0c <+812>:   vextractf128 $0x1,%ymm0,0x10(%r8)
   0x00007fffff710c13 <+819>:   vzeroupper
   0x00007fffff710c16 <+822>:   pop    %rbx
   0x00007fffff710c17 <+823>:   pop    %r12
   0x00007fffff710c19 <+825>:   pop    %r13
   0x00007fffff710c1b <+827>:   pop    %rbp
   0x00007fffff710c1c <+828>:   retq

如果我使用 -march=native 来构建:-g -fopenmp -lpthread -march=native -O3

if I use -march=native to build:-g -fopenmp -lpthread -march=native -O3

/usr/lib/gcc/x86_64-linux-gnu/9/include/avxintrin.h:
898       return *(__m256_u *)__P;
   0x00007fffff711596 <+134>:   vxorps %xmm1,%xmm1,%xmm1
   0x00007fffff71159a <+138>:   lea    0x4(,%r10,4),%r13
   0x00007fffff7115a2 <+146>:   lea    0x4(,%rdi,4),%rdi
   0x00007fffff7115aa <+154>:   vmovaps %ymm1,%ymm2
   0x00007fffff7115ae <+158>:   mov    (%rax),%rax
   0x00007fffff7115b1 <+161>:   lea    0x4(,%rsi,4),%rsi
   0x00007fffff7115b9 <+169>:   lea    (%r11,%rdi,1),%rbx
   0x00007fffff7115bd <+173>:   lea    (%r11,%rsi,1),%r12
   0x00007fffff7115c1 <+177>:   lea    (%rdx,%rdi,1),%r10
   0x00007fffff7115c5 <+181>:   add    %r13,%r11
   0x00007fffff7115c8 <+184>:   add    %rcx,%rdi
   0x00007fffff7115cb <+187>:   vmovups (%rax),%ymm0

/home/tianzer/Anime4KCPP/Anime4KCore/src/CPUCNNProcessor.cpp:
390             for (int i = 0; i < 8; i += 2)
=> 0x00007fffff7115cf <+191>:   mov    0x10(%r9),%rax
   0x00007fffff7115d3 <+195>:   lea    (%rdx,%rsi,1),%r9
   0x00007fffff7115d7 <+199>:   add    %r13,%rdx
   0x00007fffff7115da <+202>:   add    %rcx,%rsi
   0x00007fffff7115dd <+205>:   add    %r13,%rcx
   0x00007fffff7115e0 <+208>:   mov    (%rax),%rax
   0x00007fffff7115e3 <+211>:   lea    0x900(%rax),%r13

/usr/lib/gcc/x86_64-linux-gnu/9/include/avxintrin.h:
735       return (__m256) __builtin_ia32_vbroadcastss256 (__X);
   0x00007fffff7115ea <+218>:   vbroadcastss -0x4(%r11),%ymm4
   0x00007fffff7115f0 <+224>:   vbroadcastss -0x4(%rbx),%ymm3
   0x00007fffff7115f6 <+230>:   add    $0x8,%r12
   0x00007fffff7115fa <+234>:   add    $0x240,%rax
   0x00007fffff711600 <+240>:   vbroadcastss -0x4(%r10),%ymm11
   0x00007fffff711606 <+246>:   vbroadcastss -0x4(%r9),%ymm10
   0x00007fffff71160c <+252>:   add    $0x8,%rbx
   0x00007fffff711610 <+256>:   add    $0x8,%r11
   0x00007fffff711614 <+260>:   vbroadcastss -0x4(%rdx),%ymm9
   0x00007fffff71161a <+266>:   vbroadcastss -0x4(%rdi),%ymm8
   0x00007fffff711620 <+272>:   add    $0x8,%r10
   0x00007fffff711624 <+276>:   add    $0x8,%r9
   0x00007fffff711628 <+280>:   vbroadcastss -0xc(%r12),%ymm5
   0x00007fffff71162f <+287>:   vbroadcastss -0x4(%rsi),%ymm7
   0x00007fffff711635 <+293>:   add    $0x8,%rdx
   0x00007fffff711639 <+297>:   add    $0x8,%rdi
   0x00007fffff71163d <+301>:   vbroadcastss -0x4(%rcx),%ymm6

/usr/lib/gcc/x86_64-linux-gnu/9/include/fmaintrin.h:
65        return (__m256)__builtin_ia32_vfmaddps256 ((__v8sf)__A, (__v8sf)__B,
   0x00007fffff711643 <+307>:   add    $0x8,%rsi
   0x00007fffff711647 <+311>:   add    $0x8,%rcx
   0x00007fffff71164b <+315>:   vfmadd132ps -0x240(%rax),%ymm0,%ymm3
   0x00007fffff711654 <+324>:   vbroadcastss -0x8(%r10),%ymm0
   0x00007fffff71165a <+330>:   vfmadd132ps -0x220(%rax),%ymm2,%ymm5
   0x00007fffff711663 <+339>:   vbroadcastss -0x8(%rsi),%ymm2
   0x00007fffff711669 <+345>:   vfmadd231ps -0x200(%rax),%ymm4,%ymm1
   0x00007fffff711672 <+354>:   vbroadcastss -0x8(%rdx),%ymm4
   0x00007fffff711678 <+360>:   vfmadd132ps -0x1e0(%rax),%ymm3,%ymm11
   0x00007fffff711681 <+369>:   vbroadcastss -0x8(%rcx),%ymm3
   0x00007fffff711687 <+375>:   vfmadd132ps -0x1c0(%rax),%ymm5,%ymm10
   0x00007fffff711690 <+384>:   vbroadcastss -0x8(%r9),%ymm5
   0x00007fffff711696 <+390>:   vfmadd132ps -0x1a0(%rax),%ymm1,%ymm9
   0x00007fffff71169f <+399>:   vbroadcastss -0x8(%rdi),%ymm1
   0x00007fffff7116a5 <+405>:   vfmadd231ps -0x180(%rax),%ymm8,%ymm11
   0x00007fffff7116ae <+414>:   vbroadcastss -0x8(%rbx),%ymm8
   0x00007fffff7116b4 <+420>:   vfmadd231ps -0x160(%rax),%ymm7,%ymm10
   0x00007fffff7116bd <+429>:   vbroadcastss -0x8(%r12),%ymm7
   0x00007fffff7116c4 <+436>:   vfmadd231ps -0x140(%rax),%ymm6,%ymm9

/usr/lib/gcc/x86_64-linux-gnu/9/include/avxintrin.h:
735       return (__m256) __builtin_ia32_vbroadcastss256 (__X);
   0x00007fffff7116cd <+445>:   vbroadcastss -0x8(%r11),%ymm6

/usr/lib/gcc/x86_64-linux-gnu/9/include/fmaintrin.h:
65        return (__m256)__builtin_ia32_vfmaddps256 ((__v8sf)__A, (__v8sf)__B,
   0x00007fffff7116d3 <+451>:   vfmadd132ps -0x120(%rax),%ymm11,%ymm8
   0x00007fffff7116dc <+460>:   vfmadd132ps -0x100(%rax),%ymm10,%ymm7
   0x00007fffff7116e5 <+469>:   vfmadd132ps -0xe0(%rax),%ymm9,%ymm6
   0x00007fffff7116ee <+478>:   vfmadd132ps -0xc0(%rax),%ymm8,%ymm0
   0x00007fffff7116f7 <+487>:   vfmadd132ps -0xa0(%rax),%ymm7,%ymm5
   0x00007fffff711700 <+496>:   vfmadd132ps -0x80(%rax),%ymm6,%ymm4
   0x00007fffff711706 <+502>:   vfmadd132ps -0x20(%rax),%ymm4,%ymm3
   0x00007fffff71170c <+508>:   vfmadd231ps -0x60(%rax),%ymm1,%ymm0
   0x00007fffff711712 <+514>:   vfmadd132ps -0x40(%rax),%ymm5,%ymm2
   0x00007fffff711718 <+520>:   vmovaps %ymm3,%ymm1

/home/tianzer/Anime4KCPP/Anime4KCore/src/CPUCNNProcessor.cpp:
390             for (int i = 0; i < 8; i += 2)
   0x00007fffff71171c <+524>:   cmp    %rax,%r13
   0x00007fffff71171f <+527>:   jne    0x7fffff7115ea <std::_Function_handler<void(int, int, float*, float*), Anime4KCPP::CPU::CNNProcessor::conv8To8(const FP*, const FP*, cv::Mat&)::<lambda(int, int, Anime4KCPP::CPU::ChanFP, Anime4KCPP::CPU::LineFP)> >::_M_invoke(const std::_Any_data &, int &&, int &&, float *&&, float *&&)+218>

/usr/lib/gcc/x86_64-linux-gnu/9/include/avxintrin.h:
1230      return __extension__ (__m256){ 0.0, 0.0, 0.0, 0.0,
   0x00007fffff711725 <+533>:   vaddps %ymm2,%ymm0,%ymm0
   0x00007fffff711729 <+537>:   vxorps %xmm1,%xmm1,%xmm1
   0x00007fffff71172d <+541>:   vaddps %ymm3,%ymm0,%ymm0
   0x00007fffff711731 <+545>:   vmaxps %ymm1,%ymm0,%ymm0

904       *(__m256_u *)__P = __A;
   0x00007fffff711735 <+549>:   vmovups %ymm0,(%r8)
   0x00007fffff71173a <+554>:   vzeroupper
   0x00007fffff71173d <+557>:   pop    %rbx
   0x00007fffff71173e <+558>:   pop    %r12
   0x00007fffff711740 <+560>:   pop    %r13
   0x00007fffff711742 <+562>:   pop    %rbp
   0x00007fffff711743 <+563>:   retq

Benchmark 结果来自 Intel i3 4170,分数是处理时间的倒数乘以一个因子,使用 gcc 中的 bin,与上面的反汇编一致.MSVC的结果几乎一样:

Benchmark results from Intel i3 4170, the score is the reciprocal of the processing time multiplied by a factor, use the bin from gcc, which is consistent with the disassembly above. results of MSVC are almost the same:

ordinary C++ code: 4.13368
-mavx2 -mfma: 2.51132
-march=native: 2.46779

我注意到在 -march=native 编译下,vfmadd231ps 直接从内存中获取操作数.是不是因为Intel的L2不够大?Ryzen 的每核 L2 是 Intel 的两倍.

I noticed that under -march=native compilation, vfmadd231ps fetches operands directly from memory. Is it because Intel's L2 is not big enough? Ryzen's L2 per core is twice that of Intel.

推荐答案

如果我没看错代码,out0 的第一次计算只需要 _r00_k00,然后 out1 需要 _r01_k01 等等.然后,对于 out0 你需要 _r03_k03 等等.

If I read the code correctly, the first calculation of out0 requires just _r00 and _k00, then out1 requires _r01 and _k01, etcetera. Then, for out0 you need _r03 and _k03, etcetera.

这是可识别的代码.我想是 GRU 神经网络吧?

That's recognizable code. A GRU neural network, I suppose?

无论如何,诀窍是合并内存中的 9 个子矩阵,这样您就只有一个权重矩阵,然后只生成一个输出向量.如果您真的需要将输出分成 3 个向量,您可以在后面的步骤中复制这些值,但这可能不是必需的.而且就算副本是必须的,如果能和激活函数合二为一的话也是相当便宜的.

Anyway, the trick is to merge the 9 sub-matrices in memory so that you only have one weights matrix, and then produce only one output vector. If you really need the output split into 3 vectors, you could copy the values in a later step, but that probably isn't necessary. And even if the copy is necessary, it's fairly cheap if you can merge it with the activation function.

这篇关于同样的AVX指令集代码Intel Core和AMD Ryzen性能差距巨大的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-29 06:09