问题描述
我尝试使用 GCC 和内在函数优化许多紧密循环.例如考虑以下函数.
I have a number of tight loops I'm trying to optimize with GCC and intrinsics. Consider for example the following function.
void triad(float *x, float *y, float *z, const int n) {
float k = 3.14159f;
int i;
__m256 k4 = _mm256_set1_ps(k);
for(i=0; i<n; i+=8) {
_mm256_store_ps(&z[i], _mm256_add_ps(_mm256_load_ps(&x[i]), _mm256_mul_ps(k4, _mm256_load_ps(&y[i]))));
}
}
这会产生一个像这样的主循环
This produces a main loop like this
20: vmulps ymm0,ymm1,[rsi+rax*1]
25: vaddps ymm0,ymm0,[rdi+rax*1]
2a: vmovaps [rdx+rax*1],ymm0
2f: add rax,0x20
33: cmp rax,rcx
36: jne 20
但是 cmp
指令是不必要的.不是让 rax
从零开始并在 sizeof(float)*n
结束,我们可以设置基指针 (rsi
, rdi
和 rdx
) 到数组的末尾并将 rax
设置为 -sizeof(float)*n
然后测试零.我可以用我自己的汇编代码来做到这一点
But the cmp
instruction is unnecessary. Instead of having rax
start at zero and finish at sizeof(float)*n
we can set the base pointers (rsi
, rdi
, and rdx
) to the end of the array and set rax
to -sizeof(float)*n
and then test for zero. I am able to do this with my own assembly code like this
.L2 vmulps ymm1, ymm2, [rdi+rax]
vaddps ymm0, ymm1, [rsi+rax]
vmovaps [rdx+rax], ymm0
add rax, 32
jne .L2
但我无法让 GCC 做到这一点.我现在有几个测试,这有很大的不同.直到最近 GCC 和内在函数已经很好地切断了我,所以我想知道是否有编译器开关或重新排序/更改我的代码的方法,因此 cmp
指令不是用 GCC 生成的.
but I can't manage to get GCC to do this. I have several tests now where this makes a significant difference. Until recently GCC and intrinsics have severed me well so I'm wondering if there is a compiler switch or a way to reorder/change my code so the cmp
instruction is not produced with GCC.
我尝试了以下操作,但它仍然产生 cmp
.我尝试过的所有变体仍然产生 cmp
.
I tried the following but it still produces cmp
. All variations I have tried still produce cmp
.
void triad2(float *x, float *y, float *z, const int n) {
float k = 3.14159f;
float *x2 = x+n;
float *y2 = y+n;
float *z2 = z+n;
int i;
__m256 k4 = _mm256_set1_ps(k);
for(i=-n; i<0; i+=8) {
_mm256_store_ps(&z2[i], _mm256_add_ps(_mm256_load_ps(&x2[i]), _mm256_mul_ps(k4, _mm256_load_ps(&y2[i]))));
}
}
对于适合 L1 缓存的数组(实际上是 n=2048
),我对最大化这些函数的指令级并行性 (ILP) 很感兴趣.尽管可以使用展开来提高带宽,但它可以降低 ILP(假设无需展开即可获得全带宽).
I'm interested in maximizing instruction level parallelism (ILP) for these functions for arrays which fit in the L1 cache (actually for n=2048
). Although unrolling can be used to improve the bandwidth it can decrease the ILP (assuming the full bandwidth can be attained without unrolling).
这是 Core2(Nehalem 之前)、IvyBridge 和 Haswell 系统的结果表.Intrinsics 是使用内在函数的结果,unroll1 是我没有使用 cmp
的汇编代码,unroll16 是我的汇编代码展开 16 次.百分比是峰值性能的百分比(频率*num_bytes_cycle,其中 num_bytes_cycle 为 SSE 为 24,AVX 为 48,FMA 为 96).
Here is a table of results for a Core2 (pre Nehalem), a IvyBridge, and a Haswell system. Intrinsics is the results of using intrinsics, unroll1 is my assembly code not using cmp
, and unroll16 is my assembly code unrolling 16 times. The percentages are the percentage of the peak performance (frequency*num_bytes_cycle where num_bytes_cycle is 24 for SSE, 48 for AVX and 96 for FMA).
SSE AVX FMA
intrinsic 71.3% 90.9% 53.6%
unroll1 97.0% 96.1% 63.5%
unroll16 98.6% 90.4% 93.6%
ScottD 96.5%
32B code align 95.5%
对于 SSE,我在不展开的情况下获得的结果几乎与展开的结果一样好,但前提是我不使用 cmp
.在 AVX 上,我无需展开和使用 cmp
即可获得最佳结果.有趣的是,IB 展开实际上更糟.在 Haswell 上,我通过展开获得了迄今为止最好的结果.这就是为什么我问这个问题.可以在那个问题中找到测试它的源代码.
For SSE I get almost as good a result without unrolling as with unroll but only if I don't use cmp
. On AVX I get the best result without unrolling and without using cmp
. It's interesting that on IB unrolling actually is worse. On Haswell I get by far the best result by unrolling. Which is why I asked this question. The source code to test this can be found in that question.
根据 ScottD 的回答,我的 Core2 系统(Nehalem 64 位模式之前的模式)的内在函数现在接近 97%.我不知道为什么 cmp
实际上很重要,因为无论如何每次迭代都应该花费 2 个时钟周期.对于 Sandy Bridge,结果证明效率损失是由于代码对齐而不是额外的 cmp
.在 Haswell 上,无论如何只能展开工作.
Based on ScottD's answer I now get almost 97% with intrinsics for my Core2 system (pre Nehalem 64-bit mode). I'm not sure why the cmp
matters actually since it should take 2 clock cycles per iteration anyway. For Sandy Bridge it turns out the efficiency loss is due to code alignment not to the extra cmp
. On Haswell only unrolling works anyway.
推荐答案
这个怎么样.编译器是 gcc 4.9.0 mingw x64:
How about this. Compiler is gcc 4.9.0 mingw x64:
void triad(float *x, float *y, float *z, const int n) {
float k = 3.14159f;
intptr_t i;
__m256 k4 = _mm256_set1_ps(k);
for(i = -n; i < 0; i += 8) {
_mm256_store_ps(&z[i+n], _mm256_add_ps(_mm256_load_ps(&x[i+n]), _mm256_mul_ps(k4, _mm256_load_ps(&y[i+n]))));
}
}
gcc -c -O3 -march=corei7 -mavx2 triad.c
gcc -c -O3 -march=corei7 -mavx2 triad.c
0000000000000000 <triad>:
0: 44 89 c8 mov eax,r9d
3: f7 d8 neg eax
5: 48 98 cdqe
7: 48 85 c0 test rax,rax
a: 79 31 jns 3d <triad+0x3d>
c: c5 fc 28 0d 00 00 00 00 vmovaps ymm1,YMMWORD PTR [rip+0x0]
14: 4d 63 c9 movsxd r9,r9d
17: 49 c1 e1 02 shl r9,0x2
1b: 4c 01 ca add rdx,r9
1e: 4c 01 c9 add rcx,r9
21: 4d 01 c8 add r8,r9
24: c5 f4 59 04 82 vmulps ymm0,ymm1,YMMWORD PTR [rdx+rax*4]
29: c5 fc 58 04 81 vaddps ymm0,ymm0,YMMWORD PTR [rcx+rax*4]
2e: c4 c1 7c 29 04 80 vmovaps YMMWORD PTR [r8+rax*4],ymm0
34: 48 83 c0 08 add rax,0x8
38: 78 ea js 24 <triad+0x24>
3a: c5 f8 77 vzeroupper
3d: c3 ret
就像您的手写代码一样,gcc 使用 5 条指令进行循环.gcc 代码使用 scale=4,而您的代码使用 scale=1.我能够让 gcc 在 5 条指令循环中使用 scale=1,但 C 代码很笨拙,循环中的 2 条 AVX 指令从 5 字节增加到 6 字节.
Like your hand written code, gcc is using 5 instructions for the loop. The gcc code uses scale=4 where yours uses scale=1. I was able to get gcc to use scale=1 with a 5 instruction loop, but the C code is awkward and 2 of the AVX instructions in the loop grow from 5 bytes to 6 bytes.
这篇关于在 GCC 中生成没有 cmp 指令的循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!