问题描述
如何在使用popcnt和SSE4.2的cpu上更快地计算数组的近似倒数平方根?
How to calculate approximate reciprocal square root of an array faster on a cpu with popcnt and SSE4.2?
输入是存储在浮点数组中的正整数(范围从0到大约200,000).
The input is positive integers (ranges from 0 to about 200,000) stored in an array of floats.
输出是一个浮点数数组.
The output is an array of floats.
两个数组都针对sse进行了正确的内存对齐.
Both arrays have correct memory alignment for sse.
下面的代码仅使用1个xmm寄存器,可在linux上运行,并且可以由gcc -O3 code.cpp -lrt -msse4.2
The code below only use 1 xmm register, runs on linux, and can be compiled by gcc -O3 code.cpp -lrt -msse4.2
谢谢.
#include <iostream>
#include <emmintrin.h>
#include <time.h>
using namespace std;
void print_xmm(__m128 xmm){
float out[4];
_mm_storeu_ps(out,xmm);
int i;
for (i = 0; i < 4; ++i) std::cout << out[i] << " ";
std::cout << std::endl;
}
void print_arr(float* ptr, size_t size){
size_t i;
for(i = 0; i < size; ++i){
cout << ptr[i] << " ";
}
cout << endl;
}
int main(void){
size_t size = 25000 * 4;
// this has to be multiple of 4
size_t repeat = 10000;
// test 10000 cycles of the code
float* ar_in = (float*)aligned_alloc(16, size*sizeof(float));
float* ar_out = (float*)aligned_alloc(16, size*sizeof(float));
//fill test data into the input array
//the data is an array of positive numbers.
size_t i;
for (i = 0; i < size; ++i){
ar_in[i] = (i+1) * (i+1);
}
//prepare for recipical square root.
__m128 xmm0;
size_t size_fix = size*sizeof(float)/sizeof(__m128);
float* ar_in_end = ar_in + size_fix;
float* ar_out_now;
float* ar_in_now;
//timing
struct timespec tp_start, tp_end;
i = repeat;
clock_gettime(CLOCK_MONOTONIC, &tp_start);
//start timing
while(--i){
ar_out_now = ar_out;
for(ar_in_now = ar_in;
ar_in_now != ar_in_end;
ar_in_now += 4, ar_out_now+=4){
//4 = sizeof(__m128)/sizeof(float);
xmm0 = _mm_load_ps(ar_in_now);
//cout << "load xmm: ";
//print_xmm(xmm0);
xmm0 = _mm_rsqrt_ps(xmm0);
//cout << "rsqrt xmm: ";
//print_xmm(xmm0);
_mm_store_ps(ar_out_now,xmm0);
}
}
//end timing
clock_gettime(CLOCK_MONOTONIC, &tp_end);
double timing;
const double nano = 0.000000001;
timing = ((double)(tp_end.tv_sec - tp_start.tv_sec )
+ (tp_end.tv_nsec - tp_start.tv_nsec) * nano)/repeat;
cout << " timing per cycle: " << timing << endl;
/*
cout << "input array: ";
print_arr(ar_in, size);
cout << "output array: ";
print_arr(ar_out,size);
*/
//free mem
free(ar_in);
free(ar_out);
return 0;
}
推荐答案
您的浮动数组有多大?如果在L1(或L2)中已经很热,则此代码的gcc5.3输出在现代Intel CPU上会限制uop吞吐量,因为它会形成一个包含6个融合域uos的循环,每次迭代执行一个向量. (因此它将每2个周期以一个向量运行.
How big is your array of floats? If it's already hot in L1 (or maybe L2), gcc5.3 output for this code bottlenecks on uop throughput on modern Intel CPUs, since it makes a loop with 6 fused-domain uops that does one vector per iteration. (So it will run at one vector per 2 cycles).
要在现代Intel CPU上实现每时钟1个向量吞吐量,您需要展开循环(有关非展开式asm无法工作的原因,请参见下文).让编译器为您执行此操作可能会很好(而不是在C ++源代码中手动进行此操作).例如使用配置文件引导的优化(gcc -fprofile-use
),或仅盲目使用-funroll-loops
.
To achieve 1 vector per clock throughput on modern Intel CPUs, you need the loop to be unrolled (see below for why non-unrolled asm can't work). Having the compiler do it for you is probably good (instead of doing it by hand in the C++ source). e.g. use profile-guided optimization (gcc -fprofile-use
), or just blindly use -funroll-loops
.
16个字节足以使一个内核饱和主存储器带宽.但是,IIRC Z Boson通过使用多个内核观察到了更好的带宽,这可能是因为多个内核使更多请求未完成,并且一个内核的停顿并没有使内存空闲.但是,如果在L2内核上的输入很热,则最好使用该内核来处理数据.
16 bytes per clock is enough in theory to saturate main memory bandwidth with one core. However, IIRC Z Boson has observed better bandwidth from using multiple cores, probably because multiple cores keep more requests outstanding, and a stall on one core doesn't leave memory idle. If the input is hot in L2 on a core, though, it's probably best to process the data using that core.
在Haswell或更高版本上,每个时钟加载和存储的16个字节仅是L1高速缓存带宽的一半,因此您需要一个AVX版本以达到最大的每核带宽.
On Haswell or later, 16 bytes loaded and stored per clock is only half of L1 cache bandwidth, so you need an AVX version of this to max per-core bandwidth.
如果您遇到内存瓶颈,则您可能希望进行Newton-Raphson迭代以获得几乎完全的精度1/sqrt(x)
,尤其是当您对一个大型数组使用多个线程时.(因为如果一个线程不能在每个时钟上维持一个负载+存储,这是可以的.)
If you bottleneck on memory, you might want to do a Newton-Raphson iteration to get a nearly-full accuracy 1/sqrt(x)
, especially if you use multiple threads for a big array. (Because then it's ok if a single thread can't sustain one load+store per clock.)
或者稍后再加载此数据时可以即时使用rsqrt
.它非常便宜,具有高吞吐量,但延迟仍然类似于FP添加.同样,如果它是一个大数组,不适合缓存,则通过减少对数据的单独传递来增加计算强度是一件大事. (将缓存阻止又称为循环平铺也可以是一个好主意,如果可以的话:运行在您的缓存大小的数据块上执行算法的多个步骤.)
Or maybe just use rsqrt
on the fly when loading this data later. It's very cheap, with high throughput, but still latency similar to an FP add. Again, if it's a big array that doesn't fit in cache, increasing computational intensity by doing fewer separate passes over the data is a big deal. (Cache Blocking aka Loop Tiling would also be a good idea, if you can do so: run multiple steps of your algorithm on a cache-sized chunk of your data.)
如果您找不到有效利用缓存的方法,则只能将绕过NT的存储作为最后的手段.如果您可以转换一些将要使用的数据,那就更好了,因此下次使用时它们会在缓存中.
Only use cache-bypassing NT stores as a last resort if you can't find a way to make effective use of cache. It's much better if you can convert some data that you're about to use, so it's in cache when it's next used.
主循环(来自.L31到jne .L31
)对于Intel SnB系列CPU是6 oups,因为. (不幸的是, Agner Fog的microarch pdf 中没有记录.)
The main loop (from .L31
to jne .L31
on the Godbolt compiler explorer) is 6 uops for Intel SnB-family CPUs, because indexed addressing modes don't micro-fuse. (This isn't yet documented in Agner Fog's microarch pdf, unfortunately.)
它是Nehalem上的4个融合域uops,只有3个ALU uops,因此Nehalem应该每个时钟运行1个.
It's 4 fused-domain uops on Nehalem, with only three ALU uops, so Nehalem should run this at 1 per clock.
.L31: # the main loop: 6 uops on SnB-family, 4 uops on Nehalem
rsqrtps xmm0, XMMWORD PTR [rbx+rax] # tmp127, MEM[base: ar_in_now_10, index: ivtmp.51_61, offset: 0B]
movaps XMMWORD PTR [rbp+0+rax], xmm0 # MEM[base: ar_out_now_12, index: ivtmp.51_61, offset: 0B], tmp127
add rax, 16 # ivtmp.51,
cmp rax, 100000 # ivtmp.51,
jne .L31 #,
由于要编写一个单独的目标,因此无法将循环降低到4个融合域uops,因此它可以每个时钟以一个向量运行而不会展开. (加载和存储都必须是单寄存器寻址模式,因此,使用由current_dst
索引的src - dst
而不是递增src
的技巧是行不通的.)
Since you want to write a separate destination, there's no way to get the loop down to 4 fused-domain uops so it can run at one vector per clock without unrolling. (Both the load and the store need to be one-register addressing modes, so the trick of using src - dst
indexed by current_dst
instead of incrementing src
doesn't work).
修改您的C ++,使gcc使用指针增量只会节省一个uop,因为您必须增量src和dst.即float *endp = start + length;
和for (p = start ; p < endp ; p+=4) {}
会像这样循环
Modifying your C++ so gcc would use a pointer increment would only save one uop, because you have to increment src and dst. i.e. float *endp = start + length;
and for (p = start ; p < endp ; p+=4) {}
would make a loop like
.loop:
rsqrtps xmm0, [rsi]
add rsi, 16
movaps [rdi], xmm0
add rdi, 16
cmp rdi, rbx
jne .loop
希望gcc会在展开时执行类似的操作,否则,如果rsqrtps
+ movaps
如果仍然使用索引寻址模式,则它们自己将是4个融合域的话,并且没有数量展开会使您的循环每个时钟以一个向量运行.
Hopefully gcc will do something like this while unrolling, otherwise the rsqrtps
+ movaps
will be 4 fused-domain uops on their own if they still use indexed addressing mode, and no amount of unrolling will make your loop run at one vector per clock.
这篇关于数组的近似倒数平方根的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!