本文介绍了在 GCC 中生成没有 cmp 指令的循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试使用 GCC 和内在函数优化许多紧密循环.例如考虑以下函数.

I have a number of tight loops I'm trying to optimize with GCC and intrinsics. Consider for example the following function.

void triad(float *x, float *y, float *z, const int n) {
    float k = 3.14159f;
    int i;
    __m256 k4 = _mm256_set1_ps(k);
    for(i=0; i<n; i+=8) {
        _mm256_store_ps(&z[i], _mm256_add_ps(_mm256_load_ps(&x[i]), _mm256_mul_ps(k4, _mm256_load_ps(&y[i]))));
    }
}

这会产生一个像这样的主循环

This produces a main loop like this

20: vmulps ymm0,ymm1,[rsi+rax*1]
25: vaddps ymm0,ymm0,[rdi+rax*1]
2a: vmovaps [rdx+rax*1],ymm0
2f: add    rax,0x20
33: cmp    rax,rcx
36: jne    20

但是 cmp 指令是不必要的.不是让 rax 从零开始并在 sizeof(float)*n 结束,我们可以设置基指针 (rsi, rdirdx) 到数组的末尾并将 rax 设置为 -sizeof(float)*n 然后测试零.我可以用我自己的汇编代码来做到这一点

But the cmp instruction is unnecessary. Instead of having rax start at zero and finish at sizeof(float)*n we can set the base pointers (rsi, rdi, and rdx) to the end of the array and set rax to -sizeof(float)*n and then test for zero. I am able to do this with my own assembly code like this

.L2  vmulps          ymm1, ymm2, [rdi+rax]
     vaddps          ymm0, ymm1, [rsi+rax]
     vmovaps         [rdx+rax], ymm0
     add             rax, 32
     jne             .L2

但我无法让 GCC 做到这一点.我现在有几个测试,这有很大的不同.直到最近 GCC 和内在函数已经很好地切断了我,所以我想知道是否有编译器开关或重新排序/更改我的代码的方法,因此 cmp 指令不是用 GCC 生成的.

but I can't manage to get GCC to do this. I have several tests now where this makes a significant difference. Until recently GCC and intrinsics have severed me well so I'm wondering if there is a compiler switch or a way to reorder/change my code so the cmp instruction is not produced with GCC.

我尝试了以下操作,但它仍然产生 cmp.我尝试过的所有变体仍然产生 cmp.

I tried the following but it still produces cmp. All variations I have tried still produce cmp.

void triad2(float *x, float *y, float *z, const int n) {
    float k = 3.14159f;
    float *x2 = x+n;
    float *y2 = y+n;
    float *z2 = z+n;
    int i;
    __m256 k4 = _mm256_set1_ps(k);
    for(i=-n; i<0; i+=8) {
        _mm256_store_ps(&z2[i], _mm256_add_ps(_mm256_load_ps(&x2[i]), _mm256_mul_ps(k4, _mm256_load_ps(&y2[i]))));
    }
}

对于适合 L1 缓存的数组(实际上是 n=2048),我对最大化这些函数的指令级并行性 (ILP) 很感兴趣.尽管可以使用展开来提高带宽,但它可以降低 ILP(假设无需展开即可获得全带宽).

I'm interested in maximizing instruction level parallelism (ILP) for these functions for arrays which fit in the L1 cache (actually for n=2048). Although unrolling can be used to improve the bandwidth it can decrease the ILP (assuming the full bandwidth can be attained without unrolling).

这是 Core2(Nehalem 之前)、IvyBridge 和 Haswell 系统的结果表.Intrinsics 是使用内在函数的结果,unroll1 是我没有使用 cmp 的汇编代码,unroll16 是我的汇编代码展开 16 次.百分比是峰值性能的百分比(频率*num_bytes_cycle,其中 num_bytes_cycle 为 SSE 为 24,AVX 为 48,FMA 为 96).

Here is a table of results for a Core2 (pre Nehalem), a IvyBridge, and a Haswell system. Intrinsics is the results of using intrinsics, unroll1 is my assembly code not using cmp, and unroll16 is my assembly code unrolling 16 times. The percentages are the percentage of the peak performance (frequency*num_bytes_cycle where num_bytes_cycle is 24 for SSE, 48 for AVX and 96 for FMA).

                 SSE         AVX         FMA
intrinsic      71.3%       90.9%       53.6%
unroll1        97.0%       96.1%       63.5%
unroll16       98.6%       90.4%       93.6%
ScottD         96.5%
32B code align             95.5%

对于 SSE,我在不展开的情况下获得的结果几乎与展开的结果一样好,但前提是我不使用 cmp.在 AVX 上,我无需展开和使用 cmp 即可获得最佳结果.有趣的是,IB 展开实际上更糟.在 Haswell 上,我通过展开获得了迄今为止最好的结果.这就是为什么我问这个问题.可以在那个问题中找到测试它的源代码.

For SSE I get almost as good a result without unrolling as with unroll but only if I don't use cmp. On AVX I get the best result without unrolling and without using cmp. It's interesting that on IB unrolling actually is worse. On Haswell I get by far the best result by unrolling. Which is why I asked this question. The source code to test this can be found in that question.

根据 ScottD 的回答,我的 Core2 系统(Nehalem 64 位模式之前的模式)的内在函数现在接近 97%.我不知道为什么 cmp实际上很重要,因为无论如何每次迭代都应该花费 2 个时钟周期.对于 Sandy Bridge,结果证明效率损失是由于代码对齐而不是额外的 cmp.在 Haswell 上,无论如何只能展开工作.

Based on ScottD's answer I now get almost 97% with intrinsics for my Core2 system (pre Nehalem 64-bit mode). I'm not sure why the cmp matters actually since it should take 2 clock cycles per iteration anyway. For Sandy Bridge it turns out the efficiency loss is due to code alignment not to the extra cmp. On Haswell only unrolling works anyway.

推荐答案

这个怎么样.编译器是 gcc 4.9.0 mingw x64:

How about this. Compiler is gcc 4.9.0 mingw x64:

void triad(float *x, float *y, float *z, const int n) {
    float k = 3.14159f;
    intptr_t i;
    __m256 k4 = _mm256_set1_ps(k);

    for(i = -n; i < 0; i += 8) {
        _mm256_store_ps(&z[i+n], _mm256_add_ps(_mm256_load_ps(&x[i+n]), _mm256_mul_ps(k4, _mm256_load_ps(&y[i+n]))));
    }
}

gcc -c -O3 -march=corei7 -mavx2 triad.c

gcc -c -O3 -march=corei7 -mavx2 triad.c

0000000000000000 <triad>:
   0:   44 89 c8                mov    eax,r9d
   3:   f7 d8                   neg    eax
   5:   48 98                   cdqe
   7:   48 85 c0                test   rax,rax
   a:   79 31                   jns    3d <triad+0x3d>
   c:   c5 fc 28 0d 00 00 00 00 vmovaps ymm1,YMMWORD PTR [rip+0x0]
  14:   4d 63 c9                movsxd r9,r9d
  17:   49 c1 e1 02             shl    r9,0x2
  1b:   4c 01 ca                add    rdx,r9
  1e:   4c 01 c9                add    rcx,r9
  21:   4d 01 c8                add    r8,r9

  24:   c5 f4 59 04 82          vmulps ymm0,ymm1,YMMWORD PTR [rdx+rax*4]
  29:   c5 fc 58 04 81          vaddps ymm0,ymm0,YMMWORD PTR [rcx+rax*4]
  2e:   c4 c1 7c 29 04 80       vmovaps YMMWORD PTR [r8+rax*4],ymm0
  34:   48 83 c0 08             add    rax,0x8
  38:   78 ea                   js     24 <triad+0x24>

  3a:   c5 f8 77                vzeroupper
  3d:   c3                      ret

就像您的手写代码一样,gcc 使用 5 条指令进行循环.gcc 代码使用 scale=4,而您的代码使用 scale=1.我能够让 gcc 在 5 条指令循环中使用 scale=1,但 C 代码很笨拙,循环中的 2 条 AVX 指令从 5 字节增加到 6 字节.

Like your hand written code, gcc is using 5 instructions for the loop. The gcc code uses scale=4 where yours uses scale=1. I was able to get gcc to use scale=1 with a 5 instruction loop, but the C code is awkward and 2 of the AVX instructions in the loop grow from 5 bytes to 6 bytes.

这篇关于在 GCC 中生成没有 cmp 指令的循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-19 19:54