为什么矢量化循环不具备的性能改进

本文介绍了为什么矢量化循环不具备的性能改进的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我调查矢量对程序的性能的影响。在这一点上，我写了下面的code：

I am investigating the effect of vectorization on the performance of the program. In this regard, I have written following code:

#include <stdio.h>
#include <sys/time.h>
#include <stdlib.h>

#define LEN 10000000

int main(){

    struct timeval stTime, endTime;

    double* a = (double*)malloc(LEN*sizeof(*a));
    double* b = (double*)malloc(LEN*sizeof(*b));
    double* c = (double*)malloc(LEN*sizeof(*c));

    int k;
    for(k = 0; k < LEN; k++){
        a[k] = rand();
        b[k] = rand();
    }

    gettimeofday(&stTime, NULL);

    for(k = 0; k < LEN; k++)
        c[k] = a[k] * b[k];

    gettimeofday(&endTime, NULL);

    FILE* fh = fopen("dump", "w");
    for(k = 0; k < LEN; k++)
        fprintf(fh, "c[%d] = %f\t", k, c[k]);
    fclose(fh);

    double timeE = (double)(endTime.tv_usec + endTime.tv_sec*1000000 - stTime.tv_usec - stTime.tv_sec*1000000);

    printf("Time elapsed: %f\n", timeE);

    return 0;
}

在此code，我只是初始化和乘以两个向量。结果保存在矢量 C 。我所主要感兴趣的是量化以下循环的效果：

In this code, I am simply initializing and multiplying two vectors. The results are saved in vector c. What I am mainly interested in is the effect of vectorizing following loop:

for(k = 0; k < LEN; k++)
    c[k] = a[k] * b[k];

我编译使用以下两个命令的code：

I compile the code using following two commands:

1) icc -O2 TestSMID.c -o TestSMID -no-vec -no-simd
2) icc -O2 TestSMID.c -o TestSMID -vec-report2

我希望看到的性能提升，因为第二个命令成功地向量化循环。然而，我的研究表明，当环路矢量没有提高性能。

I expect to see performance improvement since the second command successfully vectorizes the loop. However, my studies show that there is no performance improvement when the loop is vectorized.

我，因为我不是超级熟悉的话题可能已经在这里错过了一些东西。所以，请让我知道，如果有什么问题我的code。

I may have missed something here since I am not super familiar with the topic. So, please let me know if there is something wrong with my code.

在此先感谢您的帮助。

PS：我使用的Mac OSX，所以没有必要对齐数据，因为所有分配的内存是16字节对齐

PS: I am using Mac OSX, so there is no need to align the data as all the allocated memories are 16-byte aligned.

编辑：
我想首先感谢大家对您的意见和解答。
我想过通过@Mysticial提出的答案，还有一些应该在这里提到的一些更多的点。
首先，@Vinska提到的， C [K] =一[K] * B [K] 用不了一个周期。除了循环索引增量和作比较，以确保 K 比 LEN 小，还有其他的事情来完成以执行该操作。具有一看由编译器生成的汇编code，可以看出，一个简单的乘法需要远远超过一个周期。该量化版本是这样的：

I would like to first thank you all for your comments and answers.I thought about the answer proposed by @Mysticial and there are some further points that should be mentioned here.Firstly, as @Vinska mentioned, c[k]=a[k]*b[k] does not take only one cycle. In addition to loop index increment and the comparison made to ensure that k is smaller than LEN, there are other things to be done to perform the operation. Having a look at the assembly code generated by the compiler, it can be seen that a simple multiplication needs much more than one cycle. The vectorized version looks like:

L_B1.9:                         # Preds L_B1.8
        movq      %r13, %rax                                    #25.5
        andq      $15, %rax                                     #25.5
        testl     %eax, %eax                                    #25.5
        je        L_B1.12       # Prob 50%                      #25.5
                                # LOE rbx r12 r13 r14 r15 eax
L_B1.10:                        # Preds L_B1.9
        testb     $7, %al                                       #25.5
        jne       L_B1.32       # Prob 10%                      #25.5
                                # LOE rbx r12 r13 r14 r15
L_B1.11:                        # Preds L_B1.10
        movsd     (%r14), %xmm0                                 #26.16
        movl      $1, %eax                                      #25.5
        mulsd     (%r15), %xmm0                                 #26.23
        movsd     %xmm0, (%r13)                                 #26.9
                                # LOE rbx r12 r13 r14 r15 eax
L_B1.12:                        # Preds L_B1.11 L_B1.9
        movl      %eax, %edx                                    #25.5
        movl      %eax, %eax                                    #26.23
        negl      %edx                                          #25.5
        andl      $1, %edx                                      #25.5
        negl      %edx                                          #25.5
        addl      $10000000, %edx                               #25.5
        lea       (%r15,%rax,8), %rcx                           #26.23
        testq     $15, %rcx                                     #25.5
        je        L_B1.16       # Prob 60%                      #25.5
                                # LOE rdx rbx r12 r13 r14 r15 eax
L_B1.13:                        # Preds L_B1.12
        movl      %eax, %eax                                    #25.5
                                # LOE rax rdx rbx r12 r13 r14 r15
L_B1.14:                        # Preds L_B1.14 L_B1.13
        movups    (%r15,%rax,8), %xmm0                          #26.23
        movsd     (%r14,%rax,8), %xmm1                          #26.16
        movhpd    8(%r14,%rax,8), %xmm1                         #26.16
        mulpd     %xmm0, %xmm1                                  #26.23
        movntpd   %xmm1, (%r13,%rax,8)                          #26.9
        addq      $2, %rax                                      #25.5
        cmpq      %rdx, %rax                                    #25.5
        jb        L_B1.14       # Prob 99%                      #25.5
        jmp       L_B1.20       # Prob 100%                     #25.5
                                # LOE rax rdx rbx r12 r13 r14 r15
L_B1.16:                        # Preds L_B1.12
        movl      %eax, %eax                                    #25.5
                                # LOE rax rdx rbx r12 r13 r14 r15
L_B1.17:                        # Preds L_B1.17 L_B1.16
        movsd     (%r14,%rax,8), %xmm0                          #26.16
        movhpd    8(%r14,%rax,8), %xmm0                         #26.16
        mulpd     (%r15,%rax,8), %xmm0                          #26.23
        movntpd   %xmm0, (%r13,%rax,8)                          #26.9
        addq      $2, %rax                                      #25.5
        cmpq      %rdx, %rax                                    #25.5
        jb        L_B1.17       # Prob 99%                      #25.5
                                # LOE rax rdx rbx r12 r13 r14 r15
L_B1.18:                        # Preds L_B1.17
        mfence                                                  #25.5
                                # LOE rdx rbx r12 r13 r14 r15
L_B1.19:                        # Preds L_B1.18
        mfence                                                  #25.5
                                # LOE rdx rbx r12 r13 r14 r15
L_B1.20:                        # Preds L_B1.14 L_B1.19 L_B1.32
        cmpq      $10000000, %rdx                               #25.5
        jae       L_B1.24       # Prob 0%                       #25.5
                                # LOE rdx rbx r12 r13 r14 r15
L_B1.22:                        # Preds L_B1.20 L_B1.22
        movsd     (%r14,%rdx,8), %xmm0                          #26.16
        mulsd     (%r15,%rdx,8), %xmm0                          #26.23
        movsd     %xmm0, (%r13,%rdx,8)                          #26.9
        incq      %rdx                                          #25.5
        cmpq      $10000000, %rdx                               #25.5
        jb        L_B1.22       # Prob 99%                      #25.5
                                # LOE rdx rbx r12 r13 r14 r15
L_B1.24:                        # Preds L_B1.22 L_B1.20

和非vectoized版本是：

And non-vectoized version is:

L_B1.9:                         # Preds L_B1.8
        xorl      %eax, %eax                                    #25.5
                                # LOE rbx r12 r13 r14 r15 eax
L_B1.10:                        # Preds L_B1.10 L_B1.9
        lea       (%rax,%rax), %edx                             #26.9
        incl      %eax                                          #25.5
        cmpl      $5000000, %eax                                #25.5
        movsd     (%r15,%rdx,8), %xmm0                          #26.16
        movsd     8(%r15,%rdx,8), %xmm1                         #26.16
        mulsd     (%r13,%rdx,8), %xmm0                          #26.23
        mulsd     8(%r13,%rdx,8), %xmm1                         #26.23
        movsd     %xmm0, (%rbx,%rdx,8)                          #26.9
        movsd     %xmm1, 8(%rbx,%rdx,8)                         #26.9
        jb        L_B1.10       # Prob 99%                      #25.5
                                # LOE rbx r12 r13 r14 r15 eax

在这旁边，处理器不加载只有24个字节。在每次访问存储器，一个完整的线（64字节）是加载。更重要的是，由于所需的内存为 A ， B 和 C 是连续的，prefetcher肯定会帮助下一块了很多，负荷在前进。
话虽如此，我认为@Mysticial计算的内存带宽是太悲观了。

Beside this, the processor does not load only 24 bytes. In each access to memory, a full line (64 bytes) is loaded. More importantly, since the memory required for a, b, and c is contiguous, prefetcher would definitely help a lot and loads next blocks in advance.Having said that, I think the memory bandwidth calculated by @Mysticial is too pessimistic.

此外，使用SIMD提高程序的一个非常简单的加法的性能在Intel矢量化指南。因此，似乎我们应该能够获得这个很简单循环的一些性能改进。

Moreover, using SIMD to improve the performance of program for a very simple addition is mentioned in Intel Vectorization Guide. Therefore, it seems we should be able to gain some performance improvement for this very simple loop.

EDIT2：
再次感谢您的意见。另外，感谢为@Mysticial样品code，我终于看到了SIMD对性能改善的效果。的问题，因为Mysticial所提到的，是存储器带宽。随着选择小尺寸 A ， B 和 C 其装配到L1高速缓存，可以看出，SIMD有助于显著提高性能。下面是我得到的结果：

Thanks again for your comments. Also, thank to @Mysticial sample code, I finally saw the effect of SIMD on performance improvement. The problem, as Mysticial mentioned, was the memory bandwidth. With choosing small size for a, b, and c which fit into the L1 cache, it can be seen that SIMD can help to improve the performance significantly. Here are the results that I got:

icc -O2 -o TestSMIDNoVec -no-vec TestSMID2.c: 17.34 sec

icc -O2 -o TestSMIDVecNoUnroll -vec-report2 TestSMID2.c: 9.33 sec

和展开循环进一步提高性能：

And unrolling the loop improves the performance even further:

icc -O2 -o TestSMIDVecUnroll -vec-report2 TestSMID2.c -unroll=8: 8.6sec

另外，我要指出，只需要一个周期为我的处理器来完成迭代时 -O2 编译。

PS：我的电脑是MacBook Pro的酷睿i5 2.5GHz的@（双核）

PS: My computer is a Macbook Pro core i5 @2.5GHz (dual core)

推荐答案

由于你受到内存带宽的瓶颈。

Because you're bottlenecked by memory bandwidth.

虽然矢量等微观优化可以提高计算速度，又不能增加内存的速度。

While vectorization and other micro-optimizations can improve the speed of computation, they can't increase the speed of your memory.

在您的例子：

for(k = 0; k < LEN; k++)
    c[k] = a[k] * b[k];

您正在单传过来的所有内存做很少的工作。这是杏你的内存带宽。

You are making a single pass over all the memory doing very little work. This is maxing out your memory bandwidth.

所以不管它是如何优化（矢量，展开，等等）是不是会得到更快。

So regardless of how it's optimized, (vectorized, unrolled, etc...) it isn't gonna get much faster.

2013年一个典型的台式机具有的 10 GB的内存带宽/ S 的顺序*上。结果你的循环触摸 24字节/迭代

A typical desktop machine of 2013 has on the order of 10 GB/s of memory bandwidth*.
Your loop touches 24 bytes/iteration.

如果没有矢量化，现代化的x64处理器的能可能会做约1迭代周期*。

Without vectorization, a modern x64 processor can probably do about 1 iteration a cycle*.

假设你正在运行在4GHz：

Suppose you're running at 4 GHz:

（4 * 10 ^ 9）* 24字节/迭代= 96 GB / s的

(4 * 10^9) * 24 bytes/iteration = 96 GB/s

这几乎是10倍的内存带宽 - 无矢量

That's almost 10x of your memory bandwidth - without vectorization.

*毫不奇怪，几个人怀疑我上面给了，因为我没有给出引用的数字。那么那些都是我的头从经验的顶部。因此，这里的一些基准来证明这一点。

*Not surprisingly, a few people doubted the numbers I gave above since I gave no citation. Well those were off the top of my head from experience. So here's some benchmarks to prove it.

的循环迭代可以为1个周期/循环的速度运行：

我们可以摆脱内存瓶颈，如果我们减少 LEN ，使其在高速缓存中适合。结果
（我用C测试这++，因为它是比较容易，但它没有什么区别。）

We can get rid of the memory bottleneck if we reduce LEN so that it fits in cache.
(I tested this in C++ since it was easier. But it makes no difference.)

#include <iostream>
#include <time.h>
using std::cout;
using std::endl;

int main(){
    const int LEN = 256;

    double *a = (double*)malloc(LEN*sizeof(*a));
    double *b = (double*)malloc(LEN*sizeof(*a));
    double *c = (double*)malloc(LEN*sizeof(*a));

    int k;
    for(k = 0; k < LEN; k++){
        a[k] = rand();
        b[k] = rand();
    }

    clock_t time0 = clock();

    for (int i = 0; i < 100000000; i++){
        for(k = 0; k < LEN; k++)
            c[k] = a[k] * b[k];
    }

    clock_t time1 = clock();
    cout << (double)(time1 - time0) / CLOCKS_PER_SEC << endl;
}

处理器：Intel酷睿i7 2600K @ 4.2GHz的

编译器：可视Studio 2012中

时间6.55秒。

在这个测试中，我跑了256亿次迭代只在 6.55 秒。

In this test, I ran 25,600,000,000 iterations in only 6.55 seconds.

6.55 * 4.2 GHz的 = 275.1亿次

275.1亿/ 256亿 = 1.074次/迭代

6.55 * 4.2 GHz = 27,510,000,000 cycles
27,510,000,000 / 25,600,000,000 = 1.074 cycles/iteration

现在如果你想知道它是如何可以做到：

Now if you're wondering how it's possible to do:

2负荷

1号店

1乘

增加计数器

比较+分支

在一个周期内...

这是因为现代的处理器和编译器是真棒。

It's because modern processors and compilers are awesome.

虽然每个这些操作有延迟（尤其是乘法），处理器能够同时执行多个迭代。我的测试机器是Sandy Bridge处理器，它是能够维持2x128b负荷，1x128b存储和1x256b矢量FP乘的每一个周期。和潜在的另外一个或两个载体或整数OPS，如果载荷为微稠微指令存储器源操作数。（只有2负载+ 1专卖店吞吐量使用256B AVX加载/存储，否则只有两个每个周期的总内存OP时（最多一家商店））。

While each of these operations have latency (especially the multiply), the processor is able to execute multiple iterations at the same time. My test machine is a Sandy Bridge processor, which is capable of sustaining 2x128b loads, 1x128b store, and 1x256b vector FP multiply every single cycle. And potentially another one or two vector or integer ops, if the loads are memory source operands for micro-fused uops. (2 loads + 1 store throughput only when using 256b AVX loads/stores, otherwise only two total memory ops per cycle (at most one store)).

综观组件（我将省略为简洁起见），似乎编译器循环展开了，从而降低了循环的开销。但它并没有完全管理向量化它。

Looking at the assembly (which I'll omit for brevity), it seems that the compiler unrolled the loop, thereby reducing the looping overhead. But it didn't quite manage to vectorize it.

内存带宽10 GB / s量级上：

要测试这个最简单的方法是通过 memset的（）：

The easiest way to test this is via a memset():

#include <iostream>
#include <time.h>
using std::cout;
using std::endl;

int main(){
    const int LEN = 1 << 30;    //  1GB

    char *a = (char*)calloc(LEN,1);

    clock_t time0 = clock();

    for (int i = 0; i < 100; i++){
        memset(a,0xff,LEN);
    }

    clock_t time1 = clock();
    cout << (double)(time1 - time0) / CLOCKS_PER_SEC << endl;
}

处理器：Intel酷睿i7 2600K @ 4.2GHz的

编译器：的Visual Studio 2012

时间：5.811秒

因此，需要我的机器上的 5.811 秒写100 GB的内存。这是关于 17.2 GB / s的

So it takes my machine 5.811 seconds to write to 100 GB of memory. That's about 17.2 GB/s.

和我的处理器是高端。在Nehalem处理器和酷睿2代处理器有较少的内存带宽。

And my processor is on the higher end. The Nehalem and Core 2 generation processors have less memory bandwidth.

MACHINE