问题描述
我正在尝试通过使用256位向量(英特尔内部函数-AVX)来增强代码的性能.
I'm trying to enhance the performance of my code by using the 256bit vector (Intel intrinsics - AVX).
我有一个I7 Gen.4(哈斯韦尔架构)处理器,支持SSE1至SSE4.2和AVX/AVX2扩展.
I have an I7 Gen.4 (Haswell architecture) processor supporting SSE1 to SSE4.2 and AVX/AVX2 Extensions.
这是我要增强的代码片段:
This is the code snippet that I'm trying to enhance:
/* code snipet */
kfac1 = kfac + factor; /* 7 cycles for 7 additions */
kfac2 = kfac1 + factor;
kfac3 = kfac2 + factor;
kfac4 = kfac3 + factor;
kfac5 = kfac4 + factor;
kfac6 = kfac5 + factor;
kfac7 = kfac6 + factor;
k1fac1 = k1fac + factor1; /* 7 cycles for 7 additions */
k1fac2 = k1fac1 + factor1;
k1fac3 = k1fac2 + factor1;
k1fac4 = k1fac3 + factor1;
k1fac5 = k1fac4 + factor1;
k1fac6 = k1fac5 + factor1;
k1fac7 = k1fac6 + factor1;
k2fac1 = k2fac + factor2; /* 7 cycles for 7 additions */
k2fac2 = k2fac1 + factor2;
k2fac3 = k2fac2 + factor2;
k2fac4 = k2fac3 + factor2;
k2fac5 = k2fac4 + factor2;
k2fac6 = k2fac5 + factor2;
k2fac7 = k2fac6 + factor2;
/* code snipet */
我从《英特尔手册》中发现了这一点.
From the Intel Manuals, I found this.
-
整数加法ADD需要1个周期(延迟).
an integer addition ADD takes 1 cycle (latency).
一个8个整数(32位)的向量也需要1个周期.
a vector of 8 integers (32 bit) takes 1 cycle also.
所以我尝试过用这种方法做到这一点:
So I've tried ton make it this way:
fac = _mm256_set1_epi32 (factor )
fac1 = _mm256_set1_epi32 (factor1)
fac2 = _mm256_set1_epi32 (factor2)
v1 = _mm256_set_epi32 (0,kfac6,kfac5,kfac4,kfac3,kfac2,kfac1,kfac)
v2 = _mm256_set_epi32 (0,k1fac6,k1fac5,k1fac4,k1fac3,k1fac2,k1fac1,k1fac)
v3 = _mm256_set_epi32 (0,k2fac6,k2fac5,k2fac4,k2fac3,k2fac2,k2fac1,k2fac)
res1 = _mm256_add_epi32 (v1,fac) ////////////////////
res2 = _mm256_add_epi32 (v2,fa1) // just 3 cycles //
res3 = _mm256_add_epi32 (v3,fa2) ////////////////////
但是问题是这些因素将被用作表索引(table [kfac] ...).因此,我必须再次将因子提取为单独的整数.我想知道是否有任何可能的方法?
But the problem is that these factors are going to be used as tables indexes ( table[kfac] ... ). So i have to extract the factor as seperate integers again.I wonder if there is any possible way to do it??
推荐答案
智能编译器可以将 table + factor
放入寄存器中,并使用索引寻址模式来获取 table + factor + k1fac6
作为地址.检查asm,如果编译器没有为您执行此操作,请尝试更改源代码以手持编译器:
A smart compiler could get table+factor
into a register and use indexed addressing modes to get table+factor+k1fac6
as an address. Check the asm, and if the compiler doesn't do this for you, try changing the source to hand-hold the compiler:
const int *tf = table + factor;
const int *tf2 = table + factor2; // could be lea rdx, [rax+rcx*4] or something.
...
foo = tf[kfac2];
bar = tf2[k2fac6]; // could be mov r12, [rdx + rdi*4]
但要回答您提出的问题:
But to answer the question you asked:
当您有许多独立的添加发生时,延迟并不是什么大问题.在Haswell上每个时钟的4个标量 add
指令的吞吐量更为重要.
Latency isn't a big deal when you have that many independent adds happening. The throughput of 4 scalar add
instructions per clock on Haswell is much more relevant.
如果 k1fac2
等已经在连续内存中,那么使用SIMD可能是值得的.否则,所有将它们移入/移出向量regs的改组和数据传输都绝对不值得.(即填充程序编译器发出的代码用于实现 _mm256_set_epi32(0,kfac6,kfac5,kfac4,kfac3,kfac2,kfac1,kfac)
.
If k1fac2
and so on are already in contiguous memory, then using SIMD is possibly worth it. Otherwise all the shuffling and data transfer to get them in/out of vector regs makes it definitely not worth it. (i.e. the stuff compiler emits to implement _mm256_set_epi32 (0,kfac6,kfac5,kfac4,kfac3,kfac2,kfac1,kfac)
.
您可以避免使用表加载的AVX2收集将索引返回到整数寄存器中.但是Haswell的搜集速度很慢,因此可能不值得.也许在Broadwell上值得.
You could avoid needing to get the indices back into integer registers by using an AVX2 gather for the table loads. But gather is slow on Haswell, so probably not worth it. Maybe worth it on Broadwell.
在Skylake上,收集速度很快,因此如果您可以对LUT结果执行SIMD操作,那可能会很好.如果您需要将所有收集结果提取回单独的整数寄存器,则可能不值得.
On Skylake, gather is fast so it could be good if you can SIMD whatever you do with the LUT results. If you need to extract all the gather results back to separate integer registers, it's probably not worth it.
如果确实需要从 __ m256i
提取8x 32位整数到整数寄存器,则有三种主要的策略选择:
If you did need to extract 8x 32-bit integers from a __m256i
into integer registers, you have three main choices of strategy:
- 向量存储到tmp数组和标量负载
- ALU随机播放指令,例如
pextrd
(_mm_extract_epi32
).使用_mm256_extracti128_si256
将专用通道放入单独的__ m128i
. - 两种策略的混合(例如,将高128存储到内存中,而在低半使用ALU填充).
- Vector store to a tmp array and scalar loads
- ALU shuffle instructions like
pextrd
(_mm_extract_epi32
). Use_mm256_extracti128_si256
to get the high lane into a separate__m128i
. - A mix of both strategies (e.g. store the high 128 to memory while using ALU stuff on the low half).
取决于周围的代码,这三种在Haswell上都是最佳的.
Depending on the surrounding code, any of these three could be optimal on Haswell.
pextrd r32,xmm,imm8
在Haswell上为2块,其中之一需要在port5上进行随机播放.这是很多随机操作,因此仅当您的代码在L1d缓存吞吐量上成为瓶颈时,纯ALU策略才是好的.(与内存带宽不同). moved r32,xmm
只有1个uop,编译器确实知道在编译 _mm_extract_epi32(vec,0)
时会使用它,但是您也可以编写 int foo =_mm_cvtsi128_si32(vec)
使其明确,并提醒自己可以更有效地访问底部元素.
pextrd r32, xmm, imm8
is 2 uops on Haswell, with one of them needing the shuffle unit on port5. That's a lot of shuffle uops, so a pure ALU strategy is only going to be good if your code is bottlenecked on L1d cache throughput. (Not the same thing as memory bandwidth). movd r32, xmm
is only 1 uop, and compilers do know to use that when compiling _mm_extract_epi32(vec, 0)
, but you can also write int foo = _mm_cvtsi128_si32(vec)
to make it explicit and remind yourself that the bottom element can be accessed more efficiently.
存储/重新加载具有良好的吞吐量.包括Haswell在内的Intel SnB系列CPU每个时钟可以运行两次负载,并且IIRC存储转发工作从对齐的32字节存储到其任何4字节元素.但是请确保它是一个统一的商店,例如放入 _Alignas(32)int tmp [8]
中,或放入 __ m256i
和 int
数组之间的并集.您仍然可以存储在 int
数组中,而不是存储在 __ m256i
成员中,以避免在对齐数组的同时进行联合类型操作,但是使用C ++ 11最简单 alignas
或C11 _Alignas
.
Store/reload has good throughput. Intel SnB-family CPUs including Haswell can run two loads per clock, and IIRC store-forwarding works from an aligned 32-byte store to any 4-byte element of it. But make sure it's an aligned store, e.g. into _Alignas(32) int tmp[8]
, or into a union between an __m256i
and an int
array. You could still store into the int
array instead of the __m256i
member to avoid union type-punning while still having the array aligned, but it's easiest to just use C++11 alignas
or C11 _Alignas
.
_Alignas(32) int tmp[8];
_mm256_store_si256((__m256i*)tmp, vec);
...
foo2 = tmp[2];
但是,存储/重新加载的问题是延迟.存储数据准备就绪后,即使第一个结果也无法在6个周期内准备就绪.
However, the problem with store/reload is latency. Even the first result won't be ready for 6 cycles after the store-data is ready.
混合策略为您提供了两全其美的优势:提取前2或3个元素的ALU使执行操作可以在使用它们的任何代码上开始,隐藏了存储/重新加载的存储转发延迟.
A mixed strategy gives you the best of both worlds: ALU to extract the first 2 or 3 elements lets execution get started on whatever code uses them, hiding the store-forwarding latency of the store/reload.
_Alignas(32) int tmp[8];
_mm256_store_si256((__m256i*)tmp, vec);
__m128i lo = _mm256_castsi256_si128(vec); // This is free, no instructions
int foo0 = _mm_cvtsi128_si32(lo);
int foo1 = _mm_extract_epi32(lo, 1);
foo2 = tmp[2];
// rest of foo3..foo7 also loaded from tmp[]
// Then use foo0..foo7
您可能会发现最好使用 pextrd
处理前4个元素,在这种情况下,您只需要存储/重新加载上车道即可.使用 vextracti128 [mem],ymm,1
:
You might find that it's optimal to do the first 4 elements with pextrd
, in which case you only need to store/reload the upper lane. Use vextracti128 [mem], ymm, 1
:
_Alignas(16) int tmp[4];
_mm_store_si128((__m128i*)tmp, _mm256_extracti128_si256(vec, 1));
// movd / pextrd for foo0..foo3
int foo4 = tmp[0];
...
使用较少的较大元素(例如64位整数),纯ALU策略更具吸引力.6周期向量存储/整数重载延迟比使用ALU op获得所有结果所需的时间更长,但是如果存在很多指令级并行性并且您限制了ALU吞吐量,那么存储/重载仍然会很好而不是延迟.
With fewer larger elements (e.g. 64-bit integers), a pure ALU strategy is more attractive. 6-cycle vector-store / integer-reload latency is longer than it would take to get all of the results with ALU ops, but store/reload could still be good if there's a lot of instruction-level parallelism and you bottleneck on ALU throughput instead of latency.
使用更多较小的元素(8或16位),存储/重装绝对有吸引力.使用ALU指令提取前2到4个元素仍然很好.甚至 vmovd r32,xmm
,然后用整数移位/掩码指令将其分开也是很好的.
With more smaller elements (8 or 16-bit), store/reload is definitely attractive. Extracting the first 2 to 4 elements with ALU instructions is still good. And maybe even vmovd r32, xmm
and then picking that apart with integer shift/mask instructions is good.
向量版本的循环计数也是虚假的.三个 _mm256_add_epi32
操作是独立的,Haswell可以并行运行两个 vpaddd
指令.(Skylake可以在一个周期中运行全部三个,每个周期有1个周期的延迟.)
Your cycle-counting for the vector version is also bogus. The three _mm256_add_epi32
operations are independent, and Haswell can run two vpaddd
instructions in parallel. (Skylake can run all three in a single cycle, each with 1 cycle latency.)
超标量流水线乱序执行意味着延迟和吞吐量之间存在很大差异,并且跟踪依赖关系链非常重要.请参见 http://agner.org/optimize/,以及 x86 标签Wiki的问题,以获取更多优化指南.
Superscalar pipelined out-of-order execution means there's a big difference between latency and throughput, and keeping track of dependency chains matters a lot. See http://agner.org/optimize/, and other links in the x86 tag wiki for more optimization guides.
这篇关于如何使用intel内在函数从256个向量中提取8个整数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!