本文介绍了如何产生C和Java的CPU缓存的效果呢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在乌尔里希Drepper的文章,第三部分内容: CPU高速缓存,他表明,显示工作集大小和CPU周期每次操作消耗(在此情况下,依次读取)之间的关系的曲线图。而且还有图中两跳这表明L1高速缓存和L2高速缓存的大小。我写我自己的程序复制在C的作用。它只是简单地看了INT []数组依次从头部到尾部,我尝试了数组(从1KB到1MB)的不同尺寸。我的数据绘制成图,并且没有跳跃,该图是一个直线。

In Ulrich Drepper's paper What every programmer should know about memory, the 3rd part: CPU Caches, he shows a graph that shows the relationship between "working set" size and the cpu cycle consuming per operation (in this case, sequential reading). And there are two jumps in the graph which indicate the size of L1 cache and L2 cache. I wrote my own program to reproduce the effect in c. It just simply read a int[] array sequentially from head to tail, and I've tried different size of the array(from 1KB to 1MB). I plot the data into a graph and there is no jump, the graph is a straight line.

我的问题是:


  1. 是不是有什么毛病我的方法?什么是生产CPU缓存的影响(见跳跃)的正确途径。

  2. 我在想,如果是顺序读,那么就应该像这样进行操作:
    当阅读的第一要素,它是一个高速缓存未命中,并且缓存行大小(64K)内,将有命中。与prefetching的帮助下,读取下一高速缓冲存储器线的延时将被隐藏。它将连续数据读入L1高速缓存,即使当工作集大小是在L1高速缓存的大小,将驱逐最近最少使用的那些,和继续prefetch。因此,大多数的高速缓存未命中将被隐藏,从L2取数据所消耗的时间会阅读活动背后隐藏的,这意味着它们是在同一时间运行。在assosiativity(在我的案件8路)将隐藏从L2读取数据的延迟。所以,我的计划的现象,应该是正确的,我失去的东西吗?

  3. 是否有可能获得在Java中同样的效果?

顺便说一句,我在linux这样做。

By the way, I am doing this in linux.

感谢斯蒂芬ç的建议,这里有一些额外的信息:
这是我的code:

Thanks for Stephen C's suggestion, here are some additional Information:This is my code:

int *arrayInt;

void initInt(long len) {
    int i;
    arrayInt = (int *)malloc(len * sizeof(int));
    memset(arrayInt, 0, len * sizeof(int));
}

long sreadInt(long len) {
    int sum = 0;
    struct timespec tsStart, tsEnd;

    initInt(len);

    clock_gettime(CLOCK_REALTIME, &tsStart);
    for(i = 0; i < len; i++) {
        sum += arrayInt[i];
    }
    clock_gettime(CLOCK_REALTIME, &tsEnd);
    free(arrayInt);
    return (tsEnd.tv_nsec - tsStart.tv_nsec) / len;
}

在main()函数,我已经从1KB试图数组的大小100MB,每个元素还是一样的,平均耗时为2纳秒。我认为时间是L1D的访问时间。

In main() function, I've tried from 1KB to 100MB of the array size, still the same, average time consuming per element is 2 nanoseconds. I think the time is the access time of L1d.

我的缓存大小:

L1D == 32K

L1d == 32k

L2 256K ==

L2 == 256k

L3 == 6144k

L3 == 6144k

我已经改变了我的code使用链表。

I've changed my code to use a linked list.

// element type
struct l {
    struct l *n;
    long int pad[NPAD]; // the NPAD could be changed, in my case I set it to 1
};

struct l *array;
long globalSum;

// for init the array
void init(long len) {
    long i, j;

    struct l *ptr;

    array = (struct l*)malloc(sizeof(struct l));
    ptr = array;
    for(j = 0; j < NPAD; j++) {
        ptr->pad[j] = j;
    }
    ptr->n = NULL;

    for(i = 1; i < len; i++) {
        ptr->n = (struct l*)malloc(sizeof(struct l));
        ptr = ptr->n;
        for(j = 0; j < NPAD; j++) {
            ptr->pad[j] = i + j;
        }
        ptr->n = NULL;
    }

}

// for free the array when operation is done
void release() {
    struct l *ptr = array;
    struct l *tmp = NULL;
    while(ptr) {
        tmp = ptr;
        ptr = ptr->n;
        free(tmp);
    }
}

double sread(long len) {
    int i;
    long sum = 0;

    struct l *ptr;
    struct timespec tsStart, tsEnd;


    init(len);

    ptr = array;

    clock_gettime(CLOCK_REALTIME, &tsStart);
    while(ptr) {
        for(i = 0; i < NPAD; i++) {
            sum += ptr->pad[i];
        }
        ptr = ptr->n;
    }
    clock_gettime(CLOCK_REALTIME, &tsEnd);

    release();

    globalSum += sum;

    return (double)(tsEnd.tv_nsec - tsStart.tv_nsec) / (double)len;
}

最后,我将在为了避免编译器优化的printf出globalSum。正如你所看到的,它仍然是一个顺序读取,我甚至已经尝试到数组的大小500MB,每个单元的平均时间约为4纳秒(也许是因为它有访问数据'垫'和指针 N',两个访问),相同的阵列尺寸的1KB。所以,我认为这是因为像prefetch缓存优化隐藏延迟非常好,是吗?我会努力的随机访问,并把结果在以后的。

At last, I will printf out the globalSum in order to avoid the compiler optimization. As you can see, it is still a sequential read, I've even tried up to 500MB of the array size, the average time per element is approximately 4 nanoseconds (maybe because it has to access the data 'pad' and the pointer 'n', two accesses), the same as 1KB of the array size. So, I think it is because the cache optimization like prefetch hide the latency very well, am I right? I will try a random access, and put the result on later.

我试图链表的随机访问,这是结果:

I've tried a random access to the linked list, this is the result:

第一红线是我的L1高速缓存大小,第二个是L2。因此,我们可以看到一个小跳在那里。而一些次延迟仍然是隐藏的很好。

the first red line is my L1 cache size, the second is L2. So we can see a little jump there. And some times the latency still be hidden well.

推荐答案

这答案不是一个答案,但更多的是一套纸币。

This answer isn't an answer, but more of a set of notes.

首先,CPU倾向于在高速缓存行操作,而不是在单个字节/字/双字。这意味着,如果顺序地读/写整数数组然后到高速缓存行中的第一接入可能会导致高速缓存未命中,但后续访问不同的整数在同一高速缓存行不会。对于64字节的高速缓存线和4字节的整数,这意味着你只能得到一个高速缓存未命中一次,每16个访问;这将淡化的效果。

First, the CPU tends to operate on cache lines, not on individual bytes/words/dwords. This means that if you sequentially read/write an array of integers then the first access to a cache line may cause a cache miss but subsequent accesses to different integers in that same cache line won't. For 64-byte cache lines and 4-byte integers this means that you'd only get a cache miss once for every 16 accesses; which will dilute the results.

二,CPU有一个硬件pre-提取器。如果检测到高速缓存行被顺序地读出,硬件$ P $对取出器将自动$ P $对 - 取高速缓存行它predicts将需要下一个(在试图将它们取入高速缓存他们'之前重新需要)。

Second, the CPU has a "hardware pre-fetcher". If it detects that cache lines are being read sequentially, the hardware pre-fetcher will automatically pre-fetch cache lines it predicts will be needed next (in an attempt to fetch them into cache before they're needed).

三,CPU做其他的事情(乱序执行),以获取隐藏成本。时间差,你可以测量(缓存命中和高速缓存未命中之间)是CPU不能隐藏时间,而不是总成本的获取。

Third, the CPU does other things ("out of order execution") to hide fetch costs. The time difference (between cache hit and cache miss) that you can measure is the time that the CPU couldn't hide and not the total cost of the fetch.

这三样东西结合意味着,对于顺序读取整数数组,这很可能是CPU pre-获取,而你正在做16从previous缓存行读取下一缓存线;任何缓存未命中的成本不会明显,并可能完全隐藏。以prevent这一点;你会想随机访问每个高速缓存行一次,最大化的工作集缓存/ s的适合和工作组不适合在高速缓存/ S之间测得的性能差异。

These 3 things combined mean that; for sequentially reading an array of integers, it's likely that the CPU pre-fetches the next cache line while you're doing 16 reads from the previous cache line; and any cache miss costs won't be noticeable and may be entirely hidden. To prevent this; you'd want to "randomly" access each cache line once, to maximise the performance difference measured between "working set fits in cache/s" and "working set doesn't fit in cache/s".

最后,有可能影响测量的其它因素。例如,对于一个OS使用寻呼(例如Linux和几乎所有其他现代操作系统)有高速缓存的上述这一切(TLB的/翻译旁视缓冲器)的整体层,并且TLB缺失一旦工作集得到超过一定大小;这应该是如在图中的第四个步骤可见。还有从内核干扰(IRQ的,缺页,任务切换,多个CPU等);这可能是因为在图中随机静止/误差可见(除非重复测试经常与异常值丢弃)。也有高速缓存设计(缓存关联),可减少在依赖于该物理地址的方式缓冲的有效性的假象/ ES由内核分配;这可能会被视为在图中的台阶转移到不同的地方。

Finally, there are other factors that may influence measurements. For example, for an OS that uses paging (e.g. Linux and almost all other modern OSs) there's a whole layer of caching above all this (TLBs/Translation Look-aside Buffers), and TLB misses once the working set gets beyond a certain size; which should be visible as a fourth "step" in the graph. There's also interference from the kernel (IRQs, page faults, task switches, multiple CPUs, etc); which might be visible as random static/error in the graph (unless tests are repeated often and outliers discarded). There are also artefacts of the cache design (cache associativity) that can reduce the effectiveness of the cache in ways that depend on the physical address/es allocated by the kernel; which might be seen as the "steps" in the graph shifting to different places.

这篇关于如何产生C和Java的CPU缓存的效果呢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-29 06:17