本文介绍了L1 内存带宽:使用相差 4096+64 字节的地址时效率下降 50%的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 Intel 处理器实现以下操作的最大带宽.

for(int i=0; i

其中 x、y 和 z 是浮点数组.我正在 Haswell、Ivy Bridge 和 Westmere 系统上执行此操作.

我原来是这样分配内存的

char *a = (char*)_mm_malloc(sizeof(float)*n, 64);char *b = (char*)_mm_malloc(sizeof(float)*n, 64);char *c = (char*)_mm_malloc(sizeof(float)*n, 64);浮动*x =(浮动*)a;浮动*y =(浮动*)b;浮动*z =(浮动*)c;

当我这样做时,我得到了每个系统预期峰值带宽的 50% 左右.

峰值计算为frequency * average bytes/clock_cycle.每个系统的平均字节/时钟周期为:

Core2:每 2 个时钟周期两个 16 字节读取一个 16 字节写入 ->24 字节/时钟周期SB/IB:每 2 个时钟周期进行两次 32 字节读取和一次 32 字节写入 ->48 字节/时钟周期Haswell:每个时钟周期进行两次 32 字节读取和一次 32 字节写入 ->96 字节/时钟周期

这意味着例如在 Haswell I 上,我只观察到 48 字节/时钟周期(可能在一个时钟周期内读取两次,在下一个时钟周期内写入一次).

我打印出了b-ac-b 的地址差异,每个都是8256 字节.值 8256 是 8192+64.所以它们每个都比数组大小(8192 字节)大一个缓存行.

一时兴起,我尝试像这样分配内存.

const int k = 0;char *mem = (char*)_mm_malloc(1<

这几乎使我的峰值带宽翻了一番,因此我现在获得了大约 90% 的峰值带宽.但是,当我尝试 k=1 时,它又回落到了 50%.我尝试了 k 的其他值,发现例如k=2, k=33, k=65 只得到峰值的 50% 但例如k=10, k=32, k=63 全速.我不明白这个.

在 Agner Fog 的微架构手册中,他说存在具有相同集合和偏移量的内存地址的错误依赖

不能同时从地址读取和写入以 4 KB 的倍数间隔.

但这正是我看到最大好处的地方!当 k=0 时,内存地址正好相差 2*4096 个字节.Agner 还谈到了缓存组冲突.但是 Haswell 和 Westmere 不应该有这些银行冲突,所以这不能解释我所观察到的.发生了什么事!?

我知道 OoO 执行决定读取和写入哪个地址,因此即使数组的内存地址相差 4096 字节,这并不一定意味着处理器读取,例如&x[0] 并同时写入 &z[0] 但是为什么被单个缓存行关闭会导致它窒息?>

根据 Evgeny Kluev 的回答,我现在相信这就是 Agner Fog 所说的虚假商店转发摊位".在 Pentium Pro、II 和 II 下的手册中,他写道:

有趣的是,你可以在写和读的时候得到一个伪造的商店转发摊位如果它们碰巧在不同的缓存中具有相同的设置值,则完全不同的地址银行:

;例 5.28.虚假的商店到货物转发摊位mov byte ptr [esi], almov ebx, dword ptr [esi+4092];没有摊位mov ecx, dword ptr [esi+4096];假档

这里是 k=0k=1 的每个系统的效率表.

 k=0 k=1韦斯特米尔:99% 66%常春藤桥:98% 44%哈斯韦尔:90% 49%

如果我假设对于 k=1 写入和读取不能在同一个时钟周期内发生,我想我可以解释这些数字.

 cycle Westmere Ivy Bridge Haswell1 读 16 读 16 读 16 读 32 读 322 写 16 读 16 读 16 写 323 写 164 写 16k=1/k=0 峰值 16/24=66% 24/48=50% 48/96=50%

这个理论很有效.常春藤桥比我预期的要低一点,但常春藤桥受到银行缓存冲突的影响,而其他人则没有,所以这可能是另一个需要考虑的影响.

以下是您自己测试的工作代码.在没有 AVX 的系统上用 g++ -O3 sum.cpp 编译,否则用 g++ -O3 -mavx sum.cpp 编译.尝试改变值 k.

//sum.cpp#include #include #include #include #define TIMER_TYPE CLOCK_REALTIMEdouble time_diff(timespec start, timespec end){时间规格温度;如果 ((end.tv_nsec-start.tv_nsec)<0) {temp.tv_sec = end.tv_sec-start.tv_sec-1;temp.tv_nsec = 1000000000+end.tv_nsec-start.tv_nsec;} 别的 {temp.tv_sec = end.tv_sec-start.tv_sec;temp.tv_nsec = end.tv_nsec-start.tv_nsec;}返回(双)temp.tv_sec +(双)temp.tv_nsec*1E-9;}void sum(float * __restrict x, float * __restrict y, float * __restrict z, const int n) {#如果已定义(__GNUC__)x = (float*)__builtin_assume_aligned (x, 64);y = (float*)__builtin_assume_aligned (y, 64);z = (float*)__builtin_assume_aligned (z, 64);#万一for(int i=0; i<n; i++) {z[i] = x[i] + y[i];}}#if(定义(__AVX__))void sum_avx(float *x, float *y, float *z, const int n) {浮动 *x1 = x;浮动 *y1 = y;浮动 *z1 = z;for(int i=0; i
解决方案

我认为 ab 之间的差距并不重要.在 bc 之间只留下一个间隙后,我在 Haswell 上得到了以下结果:

k %-----1 482 483 484 485 466 537 598 679 7310 8111 8512 8713 87...0 86

由于众所周知 Haswell 没有银行冲突,唯一剩下的解释是内存地址之间的错误依赖(并且您在 Agner Fog 的微体系结构手册中找到了正确解释这个问题的地方).存储体冲突和错误共享之间的区别在于存储体冲突可防止在同一时钟周期内两次访问同一个存储体,而错误共享可防止在您将某些内容写入同一偏移量后立即从 4K 内存块中的某个偏移量读取(并且不仅是在同一个时钟周期内,但在写入后的几个时钟周期内).

因为您的代码(对于 k=0)会在 从同一偏移量进行两次读取之后写入任何偏移量,并且很长一段时间都不会从中读取,这种情况应该被认为是最好的",所以我把 k=0 放在表格的末尾.对于 k=1,您总是从最近被覆盖的偏移量中读取,这意味着错误共享并因此导致性能下降.随着写入和读取之间的 k 时间增加,CPU 内核有更多机会通过所有内存层次结构传递写入的数据(这意味着读取和写入的两个地址转换,更新缓存数据和标签以及从缓存、内核之间的数据同步,以及可能更多的东西).k=12 或 24 个时钟(在我的 CPU 上)足以让每个写入的数据为后续读取操作做好准备,因此从这个值开始,性能会恢复正常.看起来与 AMD 上的 20 多个时钟没有太大区别(如@Mysticial 所说).

I want to achieve the maximum bandwidth of the following operations with Intel processors.

for(int i=0; i<n; i++) z[i] = x[i] + y[i]; //n=2048

where x, y, and z are float arrays. I am doing this on Haswell, Ivy Bridge , and Westmere systems.

I originally allocated the memory like this

char *a = (char*)_mm_malloc(sizeof(float)*n, 64);
char *b = (char*)_mm_malloc(sizeof(float)*n, 64);
char *c = (char*)_mm_malloc(sizeof(float)*n, 64);
float *x = (float*)a; float *y = (float*)b; float *z = (float*)c;

When I did this I got about 50% of the peak bandwidth I expected for each system.

The peak values is calculated as frequency * average bytes/clock_cycle. The average bytes/clock cycle for each system is:

Core2: two 16 byte reads one 16 byte write per 2 clock cycles     -> 24 bytes/clock cycle
SB/IB: two 32 byte reads and one 32 byte write per 2 clock cycles -> 48 bytes/clock cycle
Haswell: two 32 byte reads and one 32 byte write per clock cycle  -> 96 bytes/clock cycle

This means that e.g. on Haswell I I only observe 48 bytes/clock cycle (could be two reads in one clock cycle and one write the next clock cycle).

I printed out the difference in the address of b-a and c-b and each are 8256 bytes. The value 8256 is 8192+64. So they are each larger than the array size (8192 bytes) by one cache-line.

On a whim I tried allocating the memory like this.

const int k = 0;
char *mem = (char*)_mm_malloc(1<<18,4096);
char *a = mem;
char *b = a+n*sizeof(float)+k*64;
char *c = b+n*sizeof(float)+k*64;
float *x = (float*)a; float *y = (float*)b; float *z = (float*)c;

This nearly doubled my peak bandwidth so that I now get around 90% of the peak bandwidth. However, when I tried k=1 it dropped back to 50%. I have tried other values of k and found that e.g. k=2, k=33, k=65 only gets 50% of the peak but e.g. k=10, k=32, k=63 gave the full speed. I don't understand this.

In Agner Fog's micrarchitecture manual he says that there is a false dependency with memory address with the same set and offset

But that's exactly where I see the biggest benefit! When k=0 the memory address differ by exactly 2*4096 bytes. Agner also talks about Cache bank conflicts. But Haswell and Westmere are not suppose to have these bank conflicts so that should not explain what I am observing. What's going on!?

I understand that the OoO execution decides which address to read and write so even if the arrays' memory addresses differ by exactly 4096 bytes that does not necessarily mean the processor reads e.g. &x[0] and writes &z[0] at the same time but then why would being off by a single cache line cause it to choke?

Edit: Based on Evgeny Kluev's answer I now believe this is what Agner Fog calls a "bogus store forwarding stall". In his manual under the Pentium Pro, II and II he writes:

; Example 5.28. Bogus store-to-load forwarding stall
mov byte ptr [esi], al
mov ebx, dword ptr [esi+4092]
; No stall
mov ecx, dword ptr [esi+4096]
; Bogus stall

Edit: Here is table of the efficiencies on each system for k=0 and k=1.

               k=0      k=1
Westmere:      99%      66%
Ivy Bridge:    98%      44%
Haswell:       90%      49%

I think I can explain these numbers if I assume that for k=1 that writes and reads cannot happen in the same clock cycle.

       cycle     Westmere          Ivy Bridge           Haswell
           1     read  16          read  16 read  16    read  32 read 32
           2     write 16          read  16 read  16    write 32
           3                       write 16
           4                       write 16

k=1/k=0 peak    16/24=66%          24/48=50%            48/96=50%

This theory works out pretty well. Ivy bridge is a bit lower than I would expect but Ivy Bridgesuffers from bank cache conflicts where the others don't so that may be another effect to consider.

Below is working code to test this yourself. On a system without AVX compile with g++ -O3 sum.cpp otherwise compile with g++ -O3 -mavx sum.cpp. Try varying the value k.

//sum.cpp
#include <x86intrin.h>
#include <stdio.h>
#include <string.h>
#include <time.h>

#define TIMER_TYPE CLOCK_REALTIME

double time_diff(timespec start, timespec end)
{
    timespec temp;
    if ((end.tv_nsec-start.tv_nsec)<0) {
        temp.tv_sec = end.tv_sec-start.tv_sec-1;
        temp.tv_nsec = 1000000000+end.tv_nsec-start.tv_nsec;
    } else {
        temp.tv_sec = end.tv_sec-start.tv_sec;
        temp.tv_nsec = end.tv_nsec-start.tv_nsec;
    }
    return (double)temp.tv_sec +  (double)temp.tv_nsec*1E-9;
}

void sum(float * __restrict x, float * __restrict y, float * __restrict z, const int n) {
    #if defined(__GNUC__)
    x = (float*)__builtin_assume_aligned (x, 64);
    y = (float*)__builtin_assume_aligned (y, 64);
    z = (float*)__builtin_assume_aligned (z, 64);
    #endif
    for(int i=0; i<n; i++) {
        z[i] = x[i] + y[i];
    }
}

#if (defined(__AVX__))
void sum_avx(float *x, float *y, float *z, const int n) {
    float *x1 = x;
    float *y1 = y;
    float *z1 = z;
    for(int i=0; i<n/64; i++) { //unroll eight times
        _mm256_store_ps(z1+64*i+  0,_mm256_add_ps(_mm256_load_ps(x1+64*i+ 0), _mm256_load_ps(y1+64*i+  0)));
        _mm256_store_ps(z1+64*i+  8,_mm256_add_ps(_mm256_load_ps(x1+64*i+ 8), _mm256_load_ps(y1+64*i+  8)));
        _mm256_store_ps(z1+64*i+ 16,_mm256_add_ps(_mm256_load_ps(x1+64*i+16), _mm256_load_ps(y1+64*i+ 16)));
        _mm256_store_ps(z1+64*i+ 24,_mm256_add_ps(_mm256_load_ps(x1+64*i+24), _mm256_load_ps(y1+64*i+ 24)));
        _mm256_store_ps(z1+64*i+ 32,_mm256_add_ps(_mm256_load_ps(x1+64*i+32), _mm256_load_ps(y1+64*i+ 32)));
        _mm256_store_ps(z1+64*i+ 40,_mm256_add_ps(_mm256_load_ps(x1+64*i+40), _mm256_load_ps(y1+64*i+ 40)));
        _mm256_store_ps(z1+64*i+ 48,_mm256_add_ps(_mm256_load_ps(x1+64*i+48), _mm256_load_ps(y1+64*i+ 48)));
        _mm256_store_ps(z1+64*i+ 56,_mm256_add_ps(_mm256_load_ps(x1+64*i+56), _mm256_load_ps(y1+64*i+ 56)));
    }
}
#else
void sum_sse(float *x, float *y, float *z, const int n) {
    float *x1 = x;
    float *y1 = y;
    float *z1 = z;
    for(int i=0; i<n/32; i++) { //unroll eight times
        _mm_store_ps(z1+32*i+  0,_mm_add_ps(_mm_load_ps(x1+32*i+ 0), _mm_load_ps(y1+32*i+  0)));
        _mm_store_ps(z1+32*i+  4,_mm_add_ps(_mm_load_ps(x1+32*i+ 4), _mm_load_ps(y1+32*i+  4)));
        _mm_store_ps(z1+32*i+  8,_mm_add_ps(_mm_load_ps(x1+32*i+ 8), _mm_load_ps(y1+32*i+  8)));
        _mm_store_ps(z1+32*i+ 12,_mm_add_ps(_mm_load_ps(x1+32*i+12), _mm_load_ps(y1+32*i+ 12)));
        _mm_store_ps(z1+32*i+ 16,_mm_add_ps(_mm_load_ps(x1+32*i+16), _mm_load_ps(y1+32*i+ 16)));
        _mm_store_ps(z1+32*i+ 20,_mm_add_ps(_mm_load_ps(x1+32*i+20), _mm_load_ps(y1+32*i+ 20)));
        _mm_store_ps(z1+32*i+ 24,_mm_add_ps(_mm_load_ps(x1+32*i+24), _mm_load_ps(y1+32*i+ 24)));
        _mm_store_ps(z1+32*i+ 28,_mm_add_ps(_mm_load_ps(x1+32*i+28), _mm_load_ps(y1+32*i+ 28)));
    }
}
#endif

int main () {
    const int n = 2048;
    const int k = 0;
    float *z2 = (float*)_mm_malloc(sizeof(float)*n, 64);

    char *mem = (char*)_mm_malloc(1<<18,4096);
    char *a = mem;
    char *b = a+n*sizeof(float)+k*64;
    char *c = b+n*sizeof(float)+k*64;

    float *x = (float*)a;
    float *y = (float*)b;
    float *z = (float*)c;
    printf("x %p, y %p, z %p, y-x %d, z-y %d
", a, b, c, b-a, c-b);

    for(int i=0; i<n; i++) {
        x[i] = (1.0f*i+1.0f);
        y[i] = (1.0f*i+1.0f);
        z[i] = 0;
    }
    int repeat = 1000000;
    timespec time1, time2;

    sum(x,y,z,n);
    #if (defined(__AVX__))
    sum_avx(x,y,z2,n);
    #else
    sum_sse(x,y,z2,n);
    #endif
    printf("error: %d
", memcmp(z,z2,sizeof(float)*n));

    while(1) {
        clock_gettime(TIMER_TYPE, &time1);
        #if (defined(__AVX__))
        for(int r=0; r<repeat; r++) sum_avx(x,y,z,n);
        #else
        for(int r=0; r<repeat; r++) sum_sse(x,y,z,n);
        #endif
        clock_gettime(TIMER_TYPE, &time2);

        double dtime = time_diff(time1,time2);
        double peak = 1.3*96; //haswell @1.3GHz
        //double peak = 3.6*48; //Ivy Bridge @ 3.6Ghz
        //double peak = 2.4*24; // Westmere @ 2.4GHz
        double rate = 3.0*1E-9*sizeof(float)*n*repeat/dtime;
        printf("dtime %f, %f GB/s, peak, %f, efficiency %f%%
", dtime, rate, peak, 100*rate/peak);
    }
}
解决方案

I think the gap between a and b does not really matter. After leaving only one gap between b and c I've got the following results on Haswell:

k   %
-----
1  48
2  48
3  48
4  48
5  46
6  53
7  59
8  67
9  73
10 81
11 85
12 87
13 87
...
0  86

Since Haswell is known to be free of bank conflicts, the only remaining explanation is false dependence between memory addresses (and you've found proper place in Agner Fog's microarchitecture manual explaining exactly this problem). The difference between bank conflict and false sharing is that bank conflict prevents accessing the same bank twice during the same clock cycle while false sharing prevents reading from some offset in 4K piece of memory just after you've written something to same offset (and not only during the same clock cycle but also for several clock cycles after the write).

Since your code (for k=0) writes to any offset just after doing two reads from the same offset and would not read from it for a very long time, this case should be considered as "best", so I placed k=0 at the end of the table. For k=1 you always read from offset that is very recently overwritten, which means false sharing and therefore performance degradation. With larger k time between write and read increases and CPU core has more chances to pass written data through all memory hierarchy (which means two address translations for read and write, updating cache data and tags and getting data from cache, data synchronization between cores, and probably many more stuff). k=12 or 24 clocks (on my CPU) is enough for every written piece of data to be ready for subsequent read operations, so starting with this value performance gets back to usual. Looks not very different from 20+ clocks on AMD (as said by @Mysticial).

这篇关于L1 内存带宽:使用相差 4096+64 字节的地址时效率下降 50%的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-22 17:27