改进递归哈达玛德变换

本文介绍了改进递归哈达玛德变换的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有以下代码来计算Hadamard变换.现在，hadamard函数是我程序的瓶颈.您看到任何加速的潜力吗?也许使用AVX2指令?典型的输入大小约为512或1024.

I have the following code to calculate a Hadamard transform. Right now, the hadamard function is the bottleneck of my program. Do you see any potential to speed it up? Maybe using AVX2 instructions? Typical input sizes are around 512 or 1024.

最好，汤姆

#include <stdio.h>

void hadamard(double *p, size_t len) {
    double tmp = 0.0;

    if(len == 2) {
        tmp = p[0];
        p[0] = tmp + p[1];
        p[1] = tmp - p[1];
    } else {
        hadamard(p, len/2);
        hadamard(p+len/2, len/2);

        for(int i = 0; i < len/2; i++) {
           tmp = p[i];
           p[i] = tmp + p[i+len/2];
           p[i+len/2] = tmp - p[i+len/2];
       }
   }
}

int main(int argc, char* argv[]) {
        double a[] = {1.0, 2.0, 3.0, 4.0};

        hadamard(a, 4);
}

推荐答案

此处是基于快速Walsh–Hadamard变换，并具有优化的首过效果.使用clang-3.3或更高版本可以正常编译，并使用 clang-4.0或更高版本可以得到不错的结果(要求- O2以适当地内联相关功能).如果没有FMA，则需要将hada2_的下两个元素与hadamard4中的-0.0进行异或(并使用普通的_mm256_add_pd).

Here is a proof-of-concept implementation, based on the Fast Walsh–Hadamard transform, with an optimized first pass. Compiles fine with clang-3.3 or later, and gives nice results with clang-4.0 or later (requires -O2 to properly inline relevant functions). If you don't have FMA, you need to xor the lower two elements of hada2_ with -0.0 in hadamard4 (and use a normal _mm256_add_pd).

我检查过的所有gcc版本都需要用手动加载/存储内部函数替换memcpy以获得类似的结果.

All gcc versions I checked would require replacing the memcpy by manual load/store intrinsics to get similar results.

此外，我还保留了练习len<16的工作.为了更好地使用可用的寄存器并减少内存访问，可能有必要实现一个hadamard32或一个类似于hadamard16的hadamard64.在C ++中，这可以通过递归模板实现来完成.

Also, I left handling of cases len<16 as exercise. And it may be worth implementing a hadamard32 and perhaps a hadamard64 similar to hadamard16, to better use the available registers and reduce memory access. In C++ this could be done with a recursive template implementation.

#include <immintrin.h> // avx+fma
#include <assert.h> // assert
#include <string.h> // memcpy

inline __m256d hadamard4(__m256d x0123)
{
    __m256d x1032 = _mm256_permute_pd(x0123, 5);             // [x1, x0, x3, x2]

    __m256d hada2 = _mm256_addsub_pd(x1032,x0123);           // [x0+x1, x0-x1, x2+x3, x2-x3]
    __m256d hada2_= _mm256_permute2f128_pd(hada2, hada2, 1); // [x2+x3, x2-x3, x0+x1, x0-x1]
    // if no FMA is available, this can be done with xoring and adding:
    __m256d res   = _mm256_fmadd_pd(hada2_, _mm256_set_pd(1.0, 1.0, -1.0, -1.0), hada2);

    return res;
}

inline void hadamard8(__m256d data[2])
{
    __m256d a = hadamard4(data[0]);
    __m256d b = hadamard4(data[1]);
    data[0] = _mm256_add_pd(a,b);
    data[1] = _mm256_sub_pd(a,b);
}

inline void hadamard16(__m256d data[4])
{
    hadamard8(data+0);
    hadamard8(data+2);
    for(int i=0; i<2; ++i)
    {
        __m256d tmp = data[i];
        data[i]  = _mm256_add_pd(tmp, data[i+2]);
        data[i+2]= _mm256_sub_pd(tmp, data[i+2]);
    }
}

void hadamard(double* p, size_t len)
{
    assert((len&(len-1))==0); // len must be power of 2
    assert(len>=16); // TODO implement fallback for smaller sizes ...

    // first pass: hadamard of 16 values each
    for(size_t i=0; i<len; i+=16)
    {
        __m256d data[4];
        memcpy(data, p+i, sizeof(data)); // should get optimized to 4x vmovupd
        hadamard16(data);
        memcpy(p+i, data, sizeof(data)); // should get optimized to 4x vmovupd

    }
    for(size_t h=32; h<len; h*=2)
    {
        for(size_t i=0; i<len; i+=2*h)
        {
            for(size_t j=i; j<i+h; j+=4)
            {
                __m256d x = _mm256_loadu_pd(p+j);
                __m256d y = _mm256_loadu_pd(p+j+h);
                _mm256_storeu_pd(p+j,   _mm256_add_pd(x,y));
                _mm256_storeu_pd(p+j+h, _mm256_sub_pd(x,y));
            }
        }
    }
}

基准测试也留作练习；-)

Benchmarking is also left as an exercise ;-)

免责声明:我没有对此进行测试.看着它，我可能在_mm256_fmadd_pd指令中混合了hada2_和hada2 ...

Disclaimer: I did not test this. Looking at it, I may have mixed up hada2_ and hada2 in the _mm256_fmadd_pd instruction ...

这篇关于改进递归哈达玛德变换的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

avx2