本文介绍了获取有关的Haswell峰值带宽的L1缓存：只获得62％的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图获得L1缓存为英特尔处理器下面的函数

全带宽

 浮动黑社会（浮点* X，浮动* Y，浮动* Z，const int的N）{
    浮K = 3.14159f;
    的for（int i = 0; I＆LT; N;我++）{
        Z [i] = X [I] + K * Y [I]
    }
}

这是从黑社会功能。

我得到的SandyBridge / IvyBridge的处理器的峰值大约95％使用此功能（使用汇编与NASM）。但是，使用的Haswell我，除非我展开循环达到峰值的62％。如果我解开16次，我得到了92％。我不明白这一点。

我决定写我的功能使用NASM汇编。在装配主循环看起来像这样。

  .L2：
    vmovaps ymm1，[RDI + RAX]
    vfmadd231ps ymm1，ymm2，[RSI + RAX]
    vmovaps [RDX + RAX]，ymm1
    添加RAX，32
    JNE .L2

原来在的例子12.7-12.11他做几乎同样的事情（但 Y [i] = Y [I] + K * X [I] ）。我设法复制他的code或多或少对我自己的（实际上他有一个小的时候，他在广播FMA3示例bug）。他给了指令长度计数，融合OPS，执行端口为除FMA4和FMA3每个处理器表。我试图让这个表我自己FMA3。

 港口
             大小μops稠0 1 2 3 4 5 6 7
vmovaps 5 1½½
vfmadd231ps 6 1½½½½
vmovaps 5 1 1 1
加4½½
JNE 2½½
-------------------------------------------------- ------------
共有22½4½1 1 1 0 1 1

尺寸指的是在字节指令长度。究其原因，添加和 JNE 指令已经半个μop是他们得到融合成一个宏指令（不被混淆仍然使用多个端口μop融合），只需要6端口和一个μop。的。为了与瓦格纳雾的表一致的，因为我认为它更有意义说哪个可以去不同的端口的指令同样进入每一个1/2的时候，给我分配了1/2的端口 vmovaps 和 vmadd231ps 可以去。

根据此表和事实，即所有2处理器可以做四μops看来这个循环应该可以在每个时钟周期，但我没有设法得到它每个时钟周期。 . In order to be consistent with Agner Fog's tables and since I think it makes more sense to say an instruction which can go to different ports equally goes to each one 1/2 of the time, I assigned 1/2 for the ports vmovaps and vmadd231ps can go to.

Based on this table and the fact that all Core2 processors can do four μops every clock cycle it appears this loop should be possible every clock cycle but I have not managed to obtain it. Can somebody please explain to me why I can't get close to the peak bandwidth for this function on Haswell without unrolling? Is this possible without unrolling and if so how can it be done? Let me be clear that I'm really trying to maximize the ILP for this function (I don't only want maximum bandwidth) so that's the reason I don't want to unroll.

Edit:Here is an update since Iwillnotexist Idonotexist showed using IACA that the stores never use port 7. I managed to break the 66% barrier without unrolling and do this in one clock cycle every iteration without unrolling(theoretically). Let's first address the store problem.

Stephen Canon mentioned in at comment that the Address Generation Unit (AGU) in port 7 can only handle simple operations such as [base + offset] and not [base + index]. In the Intel optimization reference manual the only thing I found was a comment on port7 which says "Simple_AGU" with no definition of what simple means. But then Iwillnotexist Idonotexist found in the comments of IACA that this problem was already mentioned six months ago in which an employee at Intel wrote on 03/11/2014:

Stephen Canon suggests "using the store address as the offset for the load operands." I have tried this like this

vmovaps         ymm1, [rdi + r9 + 32*i]
vfmadd231ps     ymm1, ymm2, [rsi + r9 + 32*i]
vmovaps         [r9 + 32*i], ymm1
add             r9, 32*unroll
cmp             r9, rcx
jne             .L2

This indeed causes the store to use port7. However, it has another problem which is that the the vmadd231ps does not fuse with the load which you can see from IACA. It also needs additionally the cmp instruction which my original function did not. So the store uses one less micro-op but the cmp (or rather then add since the cmp macro fuses with the jne) needs one more. IACA reports a block throughput of 1.5. In practice this only get about 57% of the peak.

But I found a way to get the vmadd231ps instruction to fuse with the load as well. This can only be done using static arrays with addressing [absolute 32-bit address + index] like this. Evgeny Kluev original suggested this.

vmovaps         ymm1, [src1_end + rax]
vfmadd231ps     ymm1, ymm2, [src2_end + rax]
vmovaps         [dst_end + rax], ymm1
add             rax, 32
jl              .L2

Where src1_end, src2_end, and dst_end are the end addresses of static arrays.

This reproduces the table in my question with four fused micro-ops that I expected. If you put this into IACA it reports a block throughput of 1.0. In theory this should do as well as the SSE and AVX versions. In practice it gets about 72% of the peak. That breaks the 66% barrier but it's still a long ways from the 92% I get unrolling 16 times. So on Haswell the only option to get close to the peak is to unroll. This is not necessary on Core2 through Ivy Bridge but it is on Haswell.

End_edit:

Here is the C/C++ Linux code to test this. The NASM code is posted after the C/C++ code. The only thing you have to change is the frequency number. In the line double frequency = 1.3; replace 1.3 with whatever the operating (not nominal) frequency of your processors is (which in case for a i5-4250U with turbo disabled in the BIOS is 1.3 GHz).

Compile with

nasm -f elf64 triad_sse_asm.asm
nasm -f elf64 triad_avx_asm.asm
nasm -f elf64 triad_fma_asm.asm
g++ -m64 -lrt -O3 -mfma  tests.cpp triad_fma_asm.o -o tests_fma
g++ -m64 -lrt -O3 -mavx  tests.cpp triad_avx_asm.o -o tests_avx
g++ -m64 -lrt -O3 -msse2 tests.cpp triad_sse_asm.o -o tests_sse

The C/C++ code

#include <x86intrin.h>
#include <stdio.h>
#include <string.h>
#include <time.h>

#define TIMER_TYPE CLOCK_REALTIME

extern "C" float triad_sse_asm_repeat(float *x, float *y, float *z, const int n, int repeat);
extern "C" float triad_sse_asm_repeat_unroll16(float *x, float *y, float *z, const int n, int repeat);
extern "C" float triad_avx_asm_repeat(float *x, float *y, float *z, const int n, int repeat);
extern "C" float triad_avx_asm_repeat_unroll16(float *x, float *y, float *z, const int n, int repeat);
extern "C" float triad_fma_asm_repeat(float *x, float *y, float *z, const int n, int repeat);
extern "C" float triad_fma_asm_repeat_unroll16(float *x, float *y, float *z, const int n, int repeat);

#if (defined(__FMA__))
float triad_fma_repeat(float *x, float *y, float *z, const int n, int repeat) {
    float k = 3.14159f;
    int r;
    for(r=0; r<repeat; r++) {
        int i;
        __m256 k4 = _mm256_set1_ps(k);
        for(i=0; i<n; i+=8) {
            _mm256_store_ps(&z[i], _mm256_fmadd_ps(k4, _mm256_load_ps(&y[i]), _mm256_load_ps(&x[i])));
        }
    }
}
#elif (defined(__AVX__))
float triad_avx_repeat(float *x, float *y, float *z, const int n, int repeat) {
    float k = 3.14159f;
    int r;
    for(r=0; r<repeat; r++) {
        int i;
        __m256 k4 = _mm256_set1_ps(k);
        for(i=0; i<n; i+=8) {
            _mm256_store_ps(&z[i], _mm256_add_ps(_mm256_load_ps(&x[i]), _mm256_mul_ps(k4, _mm256_load_ps(&y[i]))));
        }
    }
}
#else
float triad_sse_repeat(float *x, float *y, float *z, const int n, int repeat) {
    float k = 3.14159f;
    int r;
    for(r=0; r<repeat; r++) {
        int i;
        __m128 k4 = _mm_set1_ps(k);
        for(i=0; i<n; i+=4) {
            _mm_store_ps(&z[i], _mm_add_ps(_mm_load_ps(&x[i]), _mm_mul_ps(k4, _mm_load_ps(&y[i]))));
        }
    }
}
#endif

double time_diff(timespec start, timespec end)
{
    timespec temp;
    if ((end.tv_nsec-start.tv_nsec)<0) {
        temp.tv_sec = end.tv_sec-start.tv_sec-1;
        temp.tv_nsec = 1000000000+end.tv_nsec-start.tv_nsec;
    } else {
        temp.tv_sec = end.tv_sec-start.tv_sec;
        temp.tv_nsec = end.tv_nsec-start.tv_nsec;
    }
    return (double)temp.tv_sec +  (double)temp.tv_nsec*1E-9;
}

int main () {
    int bytes_per_cycle = 0;
    double frequency = 1.3;  //Haswell
    //double frequency = 3.6;  //IB
    //double frequency = 2.66;  //Core2
    #if (defined(__FMA__))
    bytes_per_cycle = 96;
    #elif (defined(__AVX__))
    bytes_per_cycle = 48;
    #else
    bytes_per_cycle = 24;
    #endif
    double peak = frequency*bytes_per_cycle;

    const int n =2048;

    float* z2 = (float*)_mm_malloc(sizeof(float)*n, 64);
    char *mem = (char*)_mm_malloc(1<<18,4096);
    char *a = mem;
    char *b = a+n*sizeof(float);
    char *c = b+n*sizeof(float);

    float *x = (float*)a;
    float *y = (float*)b;
    float *z = (float*)c;

    for(int i=0; i<n; i++) {
        x[i] = 1.0f*i;
        y[i] = 1.0f*i;
        z[i] = 0;
    }
    int repeat = 1000000;
    timespec time1, time2;
    #if (defined(__FMA__))
    triad_fma_repeat(x,y,z2,n,repeat);
    #elif (defined(__AVX__))
    triad_avx_repeat(x,y,z2,n,repeat);
    #else
    triad_sse_repeat(x,y,z2,n,repeat);
    #endif

    while(1) {
        double dtime, rate;

        clock_gettime(TIMER_TYPE, &time1);
        #if (defined(__FMA__))
        triad_fma_asm_repeat(x,y,z,n,repeat);
        #elif (defined(__AVX__))
        triad_avx_asm_repeat(x,y,z,n,repeat);
        #else
        triad_sse_asm_repeat(x,y,z,n,repeat);
        #endif
        clock_gettime(TIMER_TYPE, &time2);
        dtime = time_diff(time1,time2);
        rate = 3.0*1E-9*sizeof(float)*n*repeat/dtime;
        printf("unroll1     rate %6.2f GB/s, efficency %6.2f%%, error %d\n", rate, 100*rate/peak, memcmp(z,z2, sizeof(float)*n));
        clock_gettime(TIMER_TYPE, &time1);
        #if (defined(__FMA__))
        triad_fma_repeat(x,y,z,n,repeat);
        #elif (defined(__AVX__))
        triad_avx_repeat(x,y,z,n,repeat);
        #else
        triad_sse_repeat(x,y,z,n,repeat);
        #endif
        clock_gettime(TIMER_TYPE, &time2);
        dtime = time_diff(time1,time2);
        rate = 3.0*1E-9*sizeof(float)*n*repeat/dtime;
        printf("intrinsic   rate %6.2f GB/s, efficency %6.2f%%, error %d\n", rate, 100*rate/peak, memcmp(z,z2, sizeof(float)*n));
        clock_gettime(TIMER_TYPE, &time1);
        #if (defined(__FMA__))
        triad_fma_asm_repeat_unroll16(x,y,z,n,repeat);
        #elif (defined(__AVX__))
        triad_avx_asm_repeat_unroll16(x,y,z,n,repeat);
        #else
        triad_sse_asm_repeat_unroll16(x,y,z,n,repeat);
        #endif
        clock_gettime(TIMER_TYPE, &time2);
        dtime = time_diff(time1,time2);
        rate = 3.0*1E-9*sizeof(float)*n*repeat/dtime;
        printf("unroll16    rate %6.2f GB/s, efficency %6.2f%%, error %d\n", rate, 100*rate/peak, memcmp(z,z2, sizeof(float)*n));
    }
}

The NASM code using the System V AMD64 ABI.

triad_fma_asm.asm:

global triad_fma_asm_repeat
;RDI x, RSI y, RDX z, RCX n, R8 repeat
;z[i] = y[i] + 3.14159*x[i]
pi: dd 3.14159
;align 16
section .text
    triad_fma_asm_repeat:
    shl             rcx, 2
    add             rdi, rcx
    add             rsi, rcx
    add             rdx, rcx
    vbroadcastss    ymm2, [rel pi]
    ;neg                rcx

align 16
.L1:
    mov             rax, rcx
    neg             rax
align 16
.L2:
    vmovaps         ymm1, [rdi+rax]
    vfmadd231ps     ymm1, ymm2, [rsi+rax]
    vmovaps         [rdx+rax], ymm1
    add             rax, 32
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    vzeroupper
    ret

global triad_fma_asm_repeat_unroll16
section .text
    triad_fma_asm_repeat_unroll16:
    shl             rcx, 2
    add             rcx, rdi
    vbroadcastss    ymm2, [rel pi]
.L1:
    xor             rax, rax
    mov             r9, rdi
    mov             r10, rsi
    mov             r11, rdx
.L2:
    %assign unroll 32
    %assign i 0
    %rep    unroll
        vmovaps         ymm1, [r9 + 32*i]
        vfmadd231ps     ymm1, ymm2, [r10 + 32*i]
        vmovaps         [r11 + 32*i], ymm1
    %assign i i+1
    %endrep
    add             r9, 32*unroll
    add             r10, 32*unroll
    add             r11, 32*unroll
    cmp             r9, rcx
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    vzeroupper
    ret

triad_ava_asm.asm:

global triad_avx_asm_repeat
;RDI x, RSI y, RDX z, RCX n, R8 repeat
pi: dd 3.14159
align 16
section .text
    triad_avx_asm_repeat:
    shl             rcx, 2
    add             rdi, rcx
    add             rsi, rcx
    add             rdx, rcx
    vbroadcastss    ymm2, [rel pi]
    ;neg                rcx

align 16
.L1:
    mov             rax, rcx
    neg             rax
align 16
.L2:
    vmulps          ymm1, ymm2, [rdi+rax]
    vaddps          ymm1, ymm1, [rsi+rax]
    vmovaps         [rdx+rax], ymm1
    add             rax, 32
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    vzeroupper
    ret

global triad_avx_asm_repeat2
;RDI x, RSI y, RDX z, RCX n, R8 repeat
;pi: dd 3.14159
align 16
section .text
    triad_avx_asm_repeat2:
    shl             rcx, 2
    vbroadcastss    ymm2, [rel pi]

align 16
.L1:
    xor             rax, rax
align 16
.L2:
    vmulps          ymm1, ymm2, [rdi+rax]
    vaddps          ymm1, ymm1, [rsi+rax]
    vmovaps         [rdx+rax], ymm1
    add             eax, 32
    cmp             eax, ecx
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    vzeroupper
    ret

global triad_avx_asm_repeat_unroll16
align 16
section .text
    triad_avx_asm_repeat_unroll16:
    shl             rcx, 2
    add             rcx, rdi
    vbroadcastss    ymm2, [rel pi]
align 16
.L1:
    xor             rax, rax
    mov             r9, rdi
    mov             r10, rsi
    mov             r11, rdx
align 16
.L2:
    %assign unroll 16
    %assign i 0
    %rep    unroll
        vmulps          ymm1, ymm2, [r9 + 32*i]
        vaddps          ymm1, ymm1, [r10 + 32*i]
        vmovaps         [r11 + 32*i], ymm1
    %assign i i+1
    %endrep
    add             r9, 32*unroll
    add             r10, 32*unroll
    add             r11, 32*unroll
    cmp             r9, rcx
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    vzeroupper
    ret

triad_sse_asm.asm:

global triad_sse_asm_repeat
;RDI x, RSI y, RDX z, RCX n, R8 repeat
pi: dd 3.14159
;align 16
section .text
    triad_sse_asm_repeat:
    shl             rcx, 2
    add             rdi, rcx
    add             rsi, rcx
    add             rdx, rcx
    movss           xmm2, [rel pi]
    shufps          xmm2, xmm2, 0
    ;neg                rcx
align 16
.L1:
    mov             rax, rcx
    neg             rax
align 16
.L2:
    movaps          xmm1, [rdi+rax]
    mulps           xmm1, xmm2
    addps           xmm1, [rsi+rax]
    movaps          [rdx+rax], xmm1
    add             rax, 16
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    ret

global triad_sse_asm_repeat2
;RDI x, RSI y, RDX z, RCX n, R8 repeat
;pi: dd 3.14159
;align 16
section .text
    triad_sse_asm_repeat2:
    shl             rcx, 2
    movss           xmm2, [rel pi]
    shufps          xmm2, xmm2, 0
align 16
.L1:
    xor             rax, rax
align 16
.L2:
    movaps          xmm1, [rdi+rax]
    mulps           xmm1, xmm2
    addps           xmm1, [rsi+rax]
    movaps          [rdx+rax], xmm1
    add             eax, 16
    cmp             eax, ecx
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    ret



global triad_sse_asm_repeat_unroll16
section .text
    triad_sse_asm_repeat_unroll16:
    shl             rcx, 2
    add             rcx, rdi
    movss           xmm2, [rel pi]
    shufps          xmm2, xmm2, 0
.L1:
    xor             rax, rax
    mov             r9, rdi
    mov             r10, rsi
    mov             r11, rdx
.L2:
    %assign unroll 8
    %assign i 0
    %rep    unroll
        movaps          xmm1, [r9 + 16*i]
        mulps           xmm1, xmm2,
        addps           xmm1, [r10 + 16*i]
        movaps          [r11 + 16*i], xmm1
    %assign i i+1
    %endrep
    add             r9, 16*unroll
    add             r10, 16*unroll
    add             r11, 16*unroll
    cmp             r9, rcx
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    ret

解决方案

`IACA Analysis`

Using IACA (the Intel Architecture Code Analyzer) reveals that macro-op fusion is indeed occurring, and that it is not the problem. It is Mysticial who is correct: The problem is that the store isn't using Port 7 at all.

IACA reports the following:

Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - ../../../tests_fma
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 1.55 Cycles       Throughput Bottleneck: FrontEnd, PORT2_AGU, PORT3_AGU

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 0.5    0.0  | 0.5  | 1.5    1.0  | 1.5    1.0  | 1.0  | 0.0  | 1.0  | 0.0  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [rdi+rax*1]
|   2    | 0.5       | 0.5 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [rsi+rax*1]
|   2    |           |     | 0.5       | 0.5       | 1.0 |     |     |     | CP | vmovaps ymmword ptr [rdx+rax*1], ymm1
|   1    |           |     |           |           |     |     | 1.0 |     |    | add rax, 0x20
|   0F   |           |     |           |           |     |     |     |     |    | jnz 0xffffffffffffffec
Total Num Of Uops: 6

In particular, the reported block throughput in cycles (1.5) jives very well with an efficiency of 66%.

A post on IACA's own website about this very phenomenon on Tue, 03/11/2014 - 12:39 was met by this reply by an Intel employee on Tue, 03/11/2014 - 23:20:

This firmly settles why Port 7 wasn't being used.

Now, contrast the above with a 32x unrolled loop (it turns out unroll16 shoudl actually be called unroll32):

Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - ../../../tests_fma
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 32.00 Cycles       Throughput Bottleneck: PORT2_AGU, Port2_DATA, PORT3_AGU, Port3_DATA, Port4, Port7

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 16.0   0.0  | 16.0 | 32.0   32.0 | 32.0   32.0 | 32.0 | 2.0  | 2.0  | 32.0 |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x20]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x20]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x20], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x40]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x40]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x40], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x60]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x60]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x60], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x80]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x80]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x80], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0xa0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0xa0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0xa0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0xc0]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0xc0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0xc0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0xe0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0xe0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0xe0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x100]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x100]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x100], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x120]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x120]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x120], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x140]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x140]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x140], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x160]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x160]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x160], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x180]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x180]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x180], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x1a0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x1a0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x1a0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x1c0]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x1c0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x1c0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x1e0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x1e0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x1e0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x200]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x200]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x200], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x220]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x220]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x220], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x240]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x240]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x240], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x260]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x260]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x260], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x280]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x280]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x280], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x2a0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x2a0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x2a0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x2c0]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x2c0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x2c0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x2e0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x2e0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x2e0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x300]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x300]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x300], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x320]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x320]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x320], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x340]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x340]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x340], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x360]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x360]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x360], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x380]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x380]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x380], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x3a0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x3a0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x3a0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x3c0]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x3c0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x3c0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x3e0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x3e0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x3e0], ymm1
|   1    |           |     |           |           |     | 1.0 |     |     |    | add r9, 0x400
|   1    |           |     |           |           |     |     | 1.0 |     |    | add r10, 0x400
|   1    |           |     |           |           |     | 1.0 |     |     |    | add r11, 0x400
|   1    |           |     |           |           |     |     | 1.0 |     |    | cmp r9, rcx
|   0F   |           |     |           |           |     |     |     |     |    | jnz 0xfffffffffffffcaf
Total Num Of Uops: 164

We see here micro-fusion and correct scheduling of the store to Port 7.

`Manual Analysis (see edit above)`

I can now answer the second of your questions: Is this possible without unrolling and if so how can it be done?. The answer is no.

I padded the arrays x, y and z to the left and right with plenty of buffer for the below experiment, and changed the inner loop to the following:

.L2:
vmovaps         ymm1, [rdi+rax] ; 1L
vmovaps         ymm0, [rsi+rax] ; 2L
vmovaps         [rdx+rax], ymm2 ; S1
add             rax, 32         ; ADD
jne             .L2             ; JMP

This intentionally does not use FMA (only loads and stores) and all load/store instructions have no dependencies, since there should therefore be no hazards whatever preventing their issue into any execution ports.

I then tested every single permutation of the first and second loads (1L and 2L), the store (S1) and the add (A) while leaving the conditional jump (J) at the end, and for each of these I tested every possible combination of offsets of x, y and z by 0 or -32 bytes (to correct for the fact that reordering the add rax, 32 before one of the r+r indexes would cause the load or store to target the wrong address). The loop was aligned to 32 bytes. The tests were run on a 2.4GHz i7-4700MQ with TurboBoost disabled by means of echo '0' > /sys/devices/system/cpu/cpufreq/boost under Linux, and using 2.4 for the frequency constant. Here are the efficiency results (maximum of 24):

Cases: 0           1           2           3           4           5           6           7
       L1  L2  S   L1  L2  S   L1  L2  S   L1  L2  S   L1  L2  S   L1  L2  S   L1  L2  S   L1  L2  S
       -0  -0  -0  -0  -0  -32 -0  -32 -0  -0  -32 -32 -32 -0  -0  -32 -0  -32 -32 -32 -0  -32 -32 -32
       ________________________________________________________________________________________________
12SAJ: 65.34%      65.34%      49.63%      65.07%      49.70%      65.05%      49.22%      65.07%
12ASJ: 48.59%      64.48%      48.74%      49.69%      48.75%      49.69%      48.99%      48.60%
1A2SJ: 49.69%      64.77%      48.67%      64.06%      49.69%      49.69%      48.94%      49.69%
1AS2J: 48.61%      64.66%      48.73%      49.71%      48.77%      49.69%      49.05%      48.74%
1S2AJ: 49.66%      65.13%      49.49%      49.66%      48.96%      64.82%      49.02%      49.66%
1SA2J: 64.44%      64.69%      49.69%      64.34%      49.69%      64.41%      48.75%      64.14%
21SAJ: 65.33%*     65.34%      49.70%      65.06%      49.62%      65.07%      49.22%      65.04%
21ASJ: Hypothetically =12ASJ
2A1SJ: Hypothetically =1A2SJ
2AS1J: Hypothetically =1AS2J
2S1AJ: Hypothetically =1S2AJ
2SA1J: Hypothetically =1SA2J
S21AJ: 48.91%      65.19%      49.04%      49.72%      49.12%      49.63%      49.21%      48.95%
S2A1J: Hypothetically =S1A2J
SA21J: Hypothetically =SA12J
SA12J: 64.69%      64.93%      49.70%      64.66%      49.69%      64.27%      48.71%      64.56%
S12AJ: 48.90%      65.20%      49.12%      49.63%      49.03%      49.70%      49.21%*     48.94%
S1A2J: 49.69%      64.74%      48.65%      64.48%      49.43%      49.69%      48.66%      49.69%
A2S1J: Hypothetically =A1S2J
A21SJ: Hypothetically =A12SJ
A12SJ: 64.62%      64.45%      49.69%      64.57%      49.69%      64.45%      48.58%      63.99%
A1S2J: 49.72%      64.69%      49.72%      49.72%      48.67%      64.46%      48.95%      49.72%
AS21J: Hypothetically =AS21J
AS12J: 48.71%      64.53%      48.76%      49.69%      48.76%      49.74%      48.93%      48.69%

We can notice a few things from the table:

Several plateaux of results, but two main ones only: Just under 50% and around 65%.
L1 and L2 can permute freely between each other without affecting the result.
Offsetting the accesses by -32 bytes can change efficiency.
The patterns we are interested in (Load 1, Load 2, Store 1 and Jump with the Add anywhere around them and the -32 offsets properly applied) are all the same, and all in the higher plateau:12SAJ Case 0 (No offsets applied), with efficiency 65.34% (the highest) 12ASJ Case 1 (S-32), with efficiency 64.48% 1A2SJ Case 3 (2L-32, S-32), with efficiency 64.06% A12SJ Case 7 (1L-32, 2L-32, S-32), with efficiency 63.99%
There always exists at least one "case" for every permutation that allows execution at the higher plateau of efficiency. In particular, Case 1 (where S-32) seems to guarantee this.
Cases 2, 4 and 6 guarantee execution at the lower plateau. They have in common that either or both of the loads are offset by -32 while the store isn't.
For cases 0, 3, 5 and 7, it depends on the permutation.

Whence we may draw at least a few conclusions:

Execution ports 2 and 3 really don't care which load address they generate and load from.
Macro-op fusion of the add and jmp appears unimpacted by any permutation of the instructions (in particular under Case 1 offsetting), leading me to believe that @Evgeny Kluev's conclusion is incorrect: The distance of the add from the jne does not appear to impact their fusion. I'm reasonably certain now that the Haswell ROB handles this correctly.What Evgeny was seeing (Going from 12SAJ with efficiency 65% to the others with efficiency 49% within Case 0) was an effect due solely to the value of the addresses loaded and stored from, and not due to an inability of the core to macro-fuse the add and branch. Further, macro-op fusion must be occurring at least some of the time, since the average loop time is 1.5 CC. If macro-op fusion did not occur this would be 2CC minimum.
Having tested all valid and invalid permutations of instructions within the not-unrolled loop, we've seen nothing higher than 65.34%. This answers empirically with a "no" the question of whether it is possible to use the full bandwidth without unrolling.

I will hypothesize several possible explanations:

We're seeing some wierd perversion due to the value of the addresses relative to each other.If so then there would exist a set of offsets of x, y and z that would allow maximum throughput. Quick random tests on my part seem not to support this.
We're seeing the loop run in one-two-step mode; The loop iterations alternate running in one clock cycle, then two.
- This could be macro-op fusion being affected by the decoders. From Agner Fog:
  Fuseable arithmetic/logic instructions cannot be decoded in the last of the four decoders on Sandy Bridge and Ivy Bridge processors. I have not tested whether this also applies to the Haswell.
- Alternately, every other clock cycle an instruction is issued to the "wrong" port, blocking the next iteration for one extra clock cycle. Such a situation would be self-correcting in the next clock cycle but would remain oscillatory.If somebody has access to the Intel performance counters, he should look at the events UOPS_EXECUTED_PORT.PORT_[0-7]. If oscillation is not occuring, all ports that are used will be pegged equally during the relevant stretch of time; Else if oscillation is occuring, there will be a 50% split. Especially important is to look at the ports Mystical pointed out (0, 1, 6 and 7).

And here's what I think is not happening:

I don't believe that the fused arithmetic+branch uop is blocking execution by going to port 0, since predicted-taken branches are sent exclusively to port 6 (see Agner Fog's Instruction Tables under Haswell -> Control transfer instructions). After a few iterations of the loop above, the branch predictor will learn that this branch is a loop and always predict as taken.

I believe this is a problem that will be solved with Intel's performance counters.

                        这篇关于获取有关的Haswell峰值带宽的L1缓存：只获得62％的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

plateau

获取有关的Haswell峰值带宽的L1缓存：只获得62％

问题描述

IACA Analysis

Manual Analysis (see edit above)

`IACA Analysis`

`Manual Analysis (see edit above)`