本文介绍了如何自动矢量化数组比较功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是的后续内容。免责声明:我已经完成了零分析,甚至没有应用程序,这纯粹是为了了解更多关于向量化的知识。

This is a follow on to this post. Disclaimer: I have done zero profiling and don't even have an application, this is purely for me to learn more about vectorization.

我的代码如下。我正在用带有i3 m370的机器编译gcc 4.9.4。第一个循环矢量化,如我所料。然而,检查temp的每个元素的第二个循环不是矢量化的AFAICT,所有的andb指令。我预计它会被向量化为像与我的第一个问题类似。在那个问题中,vector是一个原始指针,所以segfaults是可能的,但这不是一个问题。因此,AFAIK重新排序比较操作在这里是安全的,但不是在那里。结论大概是相同的。

Update:This post is similar to my first question. In that question, the vector was a raw pointer so segfaults are possible, but here that isn't a concern. Therefore AFAIK reordering the comparison operations is safe here, but not there. The conclusion is probably the same though.

推荐答案

Autovectorization真的喜欢减少操作,所以诀窍就是将其降低。

Autovectorization really likes reductions operations, so the trick was to turn this into a reduction.

#define ARR_LENGTH 4096
typedef float afloat __attribute__ ((__aligned__(16)));
int foo(afloat *a, afloat *b){
    unsigned int i, j;
    unsigned int result;
    unsigned int blocksize = 4;
    for (i=0; i<ARR_LENGTH; i+=blocksize){
        result = 0;
        for (j=0; j<blocksize; j++){
            result += (*a) == (*b);
            a++;
            b++;
        }
        if (result == blocksize){
            blocksize *= 2;
        } else {
            break;
        }
    }
    blocksize = ARR_LENGTH - i;
    for (i=0; i<blocksize; i++){
        result += (*a) == (*b);
        a++;
        b++;
    }
    return result == i;
}

编译成一个漂亮的循环:

Compiles into a nice loop:

.L3:
        movaps  (%rdi,%rax), %xmm1
        addl    $1, %ecx
        cmpeqps (%rsi,%rax), %xmm1
        addq    $16, %rax
        cmpl    %r8d, %ecx
        psubd   %xmm1, %xmm0
        jb      .L3

这篇关于如何自动矢量化数组比较功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-29 09:02