问题描述
我有以下代码来计算Hadamard变换.现在,hadamard函数是我程序的瓶颈.您看到任何加速的潜力吗?也许使用AVX2指令?典型的输入大小约为512或1024.
I have the following code to calculate a Hadamard transform. Right now, the hadamard function is the bottleneck of my program. Do you see any potential to speed it up? Maybe using AVX2 instructions? Typical input sizes are around 512 or 1024.
最好,汤姆
#include <stdio.h>
void hadamard(double *p, size_t len) {
double tmp = 0.0;
if(len == 2) {
tmp = p[0];
p[0] = tmp + p[1];
p[1] = tmp - p[1];
} else {
hadamard(p, len/2);
hadamard(p+len/2, len/2);
for(int i = 0; i < len/2; i++) {
tmp = p[i];
p[i] = tmp + p[i+len/2];
p[i+len/2] = tmp - p[i+len/2];
}
}
}
int main(int argc, char* argv[]) {
double a[] = {1.0, 2.0, 3.0, 4.0};
hadamard(a, 4);
}
推荐答案
此处是基于快速Walsh–Hadamard变换,并具有优化的首过效果.使用clang-3.3或更高版本可以正常编译,并使用 clang-4.0或更高版本可以得到不错的结果(要求- O2以适当地内联相关功能).如果没有FMA,则需要将hada2_
的下两个元素与hadamard4
中的-0.0
进行异或(并使用普通的_mm256_add_pd
).
Here is a proof-of-concept implementation, based on the Fast Walsh–Hadamard transform, with an optimized first pass. Compiles fine with clang-3.3 or later, and gives nice results with clang-4.0 or later (requires -O2 to properly inline relevant functions). If you don't have FMA, you need to xor the lower two elements of hada2_
with -0.0
in hadamard4
(and use a normal _mm256_add_pd
).
我检查过的所有gcc版本都需要用手动加载/存储内部函数替换memcpy
以获得类似的结果.
All gcc versions I checked would require replacing the memcpy
by manual load/store intrinsics to get similar results.
此外,我还保留了练习len<16
的工作.为了更好地使用可用的寄存器并减少内存访问,可能有必要实现一个hadamard32
或一个类似于hadamard16
的hadamard64
.在C ++中,这可以通过递归模板实现来完成.
Also, I left handling of cases len<16
as exercise. And it may be worth implementing a hadamard32
and perhaps a hadamard64
similar to hadamard16
, to better use the available registers and reduce memory access. In C++ this could be done with a recursive template implementation.
#include <immintrin.h> // avx+fma
#include <assert.h> // assert
#include <string.h> // memcpy
inline __m256d hadamard4(__m256d x0123)
{
__m256d x1032 = _mm256_permute_pd(x0123, 5); // [x1, x0, x3, x2]
__m256d hada2 = _mm256_addsub_pd(x1032,x0123); // [x0+x1, x0-x1, x2+x3, x2-x3]
__m256d hada2_= _mm256_permute2f128_pd(hada2, hada2, 1); // [x2+x3, x2-x3, x0+x1, x0-x1]
// if no FMA is available, this can be done with xoring and adding:
__m256d res = _mm256_fmadd_pd(hada2_, _mm256_set_pd(1.0, 1.0, -1.0, -1.0), hada2);
return res;
}
inline void hadamard8(__m256d data[2])
{
__m256d a = hadamard4(data[0]);
__m256d b = hadamard4(data[1]);
data[0] = _mm256_add_pd(a,b);
data[1] = _mm256_sub_pd(a,b);
}
inline void hadamard16(__m256d data[4])
{
hadamard8(data+0);
hadamard8(data+2);
for(int i=0; i<2; ++i)
{
__m256d tmp = data[i];
data[i] = _mm256_add_pd(tmp, data[i+2]);
data[i+2]= _mm256_sub_pd(tmp, data[i+2]);
}
}
void hadamard(double* p, size_t len)
{
assert((len&(len-1))==0); // len must be power of 2
assert(len>=16); // TODO implement fallback for smaller sizes ...
// first pass: hadamard of 16 values each
for(size_t i=0; i<len; i+=16)
{
__m256d data[4];
memcpy(data, p+i, sizeof(data)); // should get optimized to 4x vmovupd
hadamard16(data);
memcpy(p+i, data, sizeof(data)); // should get optimized to 4x vmovupd
}
for(size_t h=32; h<len; h*=2)
{
for(size_t i=0; i<len; i+=2*h)
{
for(size_t j=i; j<i+h; j+=4)
{
__m256d x = _mm256_loadu_pd(p+j);
__m256d y = _mm256_loadu_pd(p+j+h);
_mm256_storeu_pd(p+j, _mm256_add_pd(x,y));
_mm256_storeu_pd(p+j+h, _mm256_sub_pd(x,y));
}
}
}
}
基准测试也留作练习;-)
Benchmarking is also left as an exercise ;-)
免责声明:我没有对此进行测试.看着它,我可能在_mm256_fmadd_pd
指令中混合了hada2_
和hada2
...
Disclaimer: I did not test this. Looking at it, I may have mixed up hada2_
and hada2
in the _mm256_fmadd_pd
instruction ...
这篇关于改进递归哈达玛德变换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!