问题描述
我想通过AVX2指令来加快以下操作的速度,但是我找不到解决方法.
I want to speed up the following operation with AVX2 instructions, but I was not able to find a way to do so.
我得到了uint64_t的大数组uint64_t data[100000]
和字节数组unsigned char indices[100000]
.我想输出一个数组uint64_t Out[256]
,其中第i个值是所有data[j]
的异或,如index[j]=i
.
I am given a large array uint64_t data[100000]
of uint64_t's, and an array unsigned char indices[100000]
of bytes. I want to output an array uint64_t Out[256]
where the i-th value is the xor of all data[j]
such that index[j]=i
.
我想要的是一个简单的实现:
A straightforward implementation of what I want is this:
uint64_t Out[256] = {0}; // initialize output array
for (i = 0; i < 100000 ; i++) {
Out[Indices[i]] ^= data[i];
}
我们可以使用AVX2指令更有效地实现这一点吗?
Can we implement this more efficiently with AVX2 instructions?
这就是我的代码现在的样子
EDIT : This is what my code looks like now
uint64_t Out[256][4] = {0}; // initialize output array
for (i = 0; i < 100000 ; i+=4) {
Out[Indices[i ]][0] ^= data[i];
Out[Indices[i+1]][1] ^= data[i+1];
Out[Indices[i+2]][2] ^= data[i+2];
Out[Indices[i+3]][3] ^= data[i+3];
}
推荐答案
基于对Haswell/Skylake的静态分析,我想出了一个版本,该版本每4个i
值以〜5个周期运行,而不是8个周期,由gcc编译时.大尺寸的平均值,不包括组合Out[]
的多个副本的时间,并假定索引的随机分布不会导致任何存储/重载dep链运行足够长的时间.
Based on static analysis for Haswell/Skylake, I came up with a version that runs in ~5 cycles per 4 i
values, instead of 8 cycles, when compiled by gcc. Average for large sizes, not including the time to combine multiple copies of Out[]
, and assuming a random distribution of Indices that doesn't lead to any store/reload dep chains running for long enough to matter.
如果您关心Ryzen或Excavator(其他2种主流AVX2微结构),则为IDK.
IDK if you care about Ryzen or Excavator (the other 2 mainstream AVX2 microachitectures).
我没有进行任何仔细的分析,但是 IACA 对于HSW/SKL是错误的,并且认为某些指令实际上确实是微熔丝的(在具有perf计数器的i7-6700k上进行了测试),因此它认为前端瓶颈更加严重比实际情况要好.例如movhps
加载+合并微熔断器,但是IACA认为它甚至没有简单的寻址模式.
I haven't done a careful analysis by hand, but IACA is wrong for HSW/SKL and thinks some instructions don't micro-fuse when in fact they do (tested on an i7-6700k with perf counters), so it thinks the front-end bottleneck is more severe than it really is. e.g. movhps
load+merge micro-fuses, but IACA thinks it doesn't even with simple addressing modes.
我们应该忽略任何缓存丢失,因为uint64_t Out[4][256]
仅为8kiB.因此,在最近的CPU上,我们的缓存占用空间仅为L1d大小的1/4,即使在两个逻辑线程之间共享L1d的超线程共享下,我们的缓存占用也应该足够好.循环遍历data[]
和Indices[]
应该会很好地预取,并且希望不会过多地退出Out[]
.因此,静态分析很有可能具有一定的准确性,并且比仔细进行微基准测试要快,而且更重要的是,它可以准确地告诉您瓶颈所在.
We should have negligible any cache misses, because uint64_t Out[4][256]
is only 8kiB. So our cache footprint is only 1/4 of L1d size on most recent CPUs, and should be mostly fine even with hyperthreading sharing L1d between two logical threads. Looping over data[]
and Indices[]
should prefetch well, and hopefully doesn't evict Out[]
much. Thus, static analysis has a good chance of being somewhat accurate, and it's quicker than careful micro-benchmarking and more importantly tells you exactly what the bottlenecks are.
但是,当然,我们严重依赖乱序执行,不完善的计划安排或其他意外瓶颈很容易发生.不过,如果我没有得到报酬,我实际上并不想进行微基准测试.
But of course we're relying heavily on out-of-order execution and imperfect scheduling or other unexpected bottlenecks could easily happen. I didn't feel like actually microbenchmarking if I'm not getting paid, though.
这基本上是一个直方图问题.通常使用使用多个表并在最后合并的直方图优化. SIMD XOR对于末端组合非常有用(只要您使用Out[4][256]
而不是Out[256][4]
.后者也需要通过按8*4
而不是8
进行缩放来使索引变慢(这可以可以在缩放索引寻址模式下使用单个LEA完成)).
This is basically a histogram problem. The usual histogram optimization of using multiple tables and combining at the end applies. SIMD XOR is useful for the combine-at-the-end (as long as you use Out[4][256]
, not Out[256][4]
. The latter also makes the indexing slower by requiring scaling by 8*4
instead of 8
(which can be done with a single LEA in a scaled-index addressing mode)).
但是,与普通的直方图不同,您是对内存中的某些数据进行XOR运算,而不是对常量1进行加法运算.因此,除了立即数1
之外,代码还必须将data[i]
加载到寄存器中作为xor
. (或加载,然后xor reg, data[i]
/存储).这比直方图还要多.
But unlike a normal histogram, you're XORing in some data from memory instead of ADDing a constant 1. So instead of an immediate 1
, the code has to load data[i]
into a register as a source for xor
. (Or load, then xor reg, data[i]
/ store). This is even more total memory operations than a histogram.
我们从手动"收集/分散到SIMD向量(使用movq
/movhps
加载/存储)领先,从而使我们可以将SIMD用于data[i]
加载和XOR.这样可以减少加载操作的总数,从而降低加载端口压力,而无需花费额外的前端带宽.
We come out ahead from "manual" gather/scatter into SIMD vectors (using movq
/ movhps
loads/stores), allowing us to use SIMD for the data[i]
load and XOR. This reduces the total number of load operations, and thus reduces load-port pressure without costing extra front-end bandwidth.
手动收集到256位向量中可能不值得进行额外的改组(额外的vinserti128/vextracti128,以便我们可以将2个内存源vpxor
组合为一个256位之一). 128位向量应该很好.前端吞吐量也是一个主要问题,因为(在Intel SnB系列CPU上)您要避免存储的索引寻址模式. gcc使用lea
指令来计算寄存器中的地址,而不是使用索引加载/存储. clang/带有-march=skylake
的LLVM决定不这样做,因为在端口2/端口3上出现环路瓶颈,并且花费额外的ALU指令以允许存储地址指令使用端口7是成功的,所以这是一个错误的决定.但是,如果您没有在p23上遇到瓶颈,那么花额外的钱来避免建立索引存储就不好了. (并且在可以保持微融合的情况下,绝对不是为了避免编入索引负载;愚蠢的gcc).也许gcc和LLVM的寻址模式成本模型不是很准确,或者它们没有足够详细地对管道建模,以致无法确定前端与特定端口上的环路瓶颈何时出现.
Manual gather into 256-bit vectors probably is probably not worth the extra shuffling (an extra vinserti128 / vextracti128 just so we can combine 2 memory-source vpxor
s into one 256-bit one). 128-bit vectors should be good. Front-end throughput is also a major issue, because (on Intel SnB-family CPUs) you want to avoid indexed addressing modes for the stores. gcc uses lea
instructions to calculate addresses in registers, instead of using indexed loads/stores. clang / LLVM with -march=skylake
decides not to, which is a bad decision in this case because the loop bottlenecks on port 2 / port 3, and spending extra ALU uops to allow stores-address uops to use port 7 is a win. But if you're not bottlenecked on p23, spending extra uops to avoid indexed stores is not good. (And in cases where the can stay micro-fused, definitely not just to avoid indexed loads; silly gcc). Maybe gcc and LLVM's addressing-mode cost models aren't very accurate, or they don't model the pipeline in enough detail to figure out when a loop bottlenecks on the front-end vs. a specific port.
选择寻址模式和其他asm代码生成选项对于在SnB系列上实现最佳性能至关重要.但是用C编写无法控制它;除非您可以调整源代码以做出其他选择,否则您通常会受到编译器的摆布.例如gcc与clang在这里有很大的不同.
Choice of addressing-modes and other asm code-gen choices are critical for this to perform optimally on SnB-family. But writing in C gives you no control over that; you're mostly at the mercy of the compiler, unless you can tweak the source to get it to make a different choice. e.g. gcc vs. clang makes a significant difference here.
在SnB系列中,movhps
负载需要端口5进行混洗/混合(尽管它确实将微熔丝装入一个uop中),但是movhps
存储是没有ALU uop的纯存储.因此,它在那里达到了收支平衡,让我们对两个数据元素使用一个SIMD加载/XOR.
On SnB-family, a movhps
load needs port 5 for the shuffle/blend (although it does micro-fuse into one uop), but a movhps
store is a pure store with no ALU uop. So it's break-even there, and lets us use one SIMD load / XOR for two data elements.
对于AVX,ALU uops允许使用未对齐的内存源操作数,因此我们不需要为data[]
要求对齐.但是Intel HSW/SKL可以使索引寻址模式与pxor
微融合,但不能与vpxor
微融合.因此,在未启用AVX的情况下进行编译可以更好,使编译器可以使用索引寻址模式,而不必增加单独的指针. (或者,如果编译器不知道并且仍使用索引寻址模式,则可以使其更快.)TL:DR:可能最好要求16字节对齐的data[]
并在禁用AVX的情况下编译该功能,强>以获得更好的宏融合. (但是然后,我们错过了在末尾组合Out
切片的256位SIMD,除非我们将其放在使用AVX或AVX2编译的不同函数中)
With AVX, unaligned memory source operands are allowed for ALU uops, so we don't need to require alignment for data[]
. But Intel HSW/SKL can keep an indexed addressing mode micro-fused with pxor
but not vpxor
. So compiling without AVX enabled can be better, allowing the compiler to use an indexed addressing mode instead of incrementing a separate pointer. (Or making it faster if the compiler doesn't know this and uses an indexed addressing mode anyway.) TL:DR: probably best to require 16-byte aligned data[]
and compile that function with AVX disabled, for better macro-fusion. (But then we miss out on 256-bit SIMD for combining the Out
slices at the end, unless we put that in a different function compiled with AVX or AVX2)
避免不对齐的负载也将避免任何高速缓存行拆分,这不会花费额外的成本,但是我们很可能接近L1d吞吐量限制的瓶颈,而不仅仅是负载/存储执行单元吞吐量限制.
Avoiding unaligned loads will avoid any cache-line splits, too, which doesn't cost extra uops but we're probably close to bottlenecking on L1d throughput limits, not just load/store execution unit throughput limits.
我还研究了一次加载4个索引并按照ALU指令进行拆包.例如用memcpy
进入struct { uint8_t idx[4]; } idx;
.但是gcc生成了许多浪费的指令来解压缩该指令.糟糕的x86没有很好的位域指令,例如ARM ubfx
或 PowerPC rlwinm
(这可能会使结果免费左移,因此,如果x86拥有该位,则静态可以在非PIC代码中使用base + disp32寻址模式.)
I also looked at loading 4 indices at once and unpacking with ALU instructions. e.g. with memcpy
into struct { uint8_t idx[4]; } idx;
. But gcc generates multiple wasted instructions for unpacking that. Too bad x86 doesn't have great bitfield instructions like ARM ubfx
or especially PowerPC rlwinm
(which could leave the result left-shifted for free, so if x86 had that, a static Out
could have used a base+disp32 addressing mode in non-PIC code.)
如果我们使用标量XOR,则用AL/AH中的shift/movzx解压缩双字是一个胜利,但是当我们在data[]
上使用SIMD并在允许存储地址oups在端口7上运行的指令,这使我们成为前端瓶颈,而不是端口2/3瓶颈,因此根据静态分析,从内存中使用4x movzx
负载看起来最好.如果您花时间手动编辑组件,则值得对两种方法进行基准测试. (由gcc生成的带有额外uos的asm太糟糕了,包括在右移24之后完全冗余的movzx
,而高位已经为零.)
Unpacking a dword with shift / movzx from AL/AH is a win if we're using scalar XOR, but it looks like it's not when we're using SIMD for data[]
and spending front-end throughput on lea
instructions to allow store-address uops to run on port 7. That makes us front-end bottlenecked instead of port2/3 bottlenecked, so using 4x movzx
loads from memory looks best according to static analysis. Would be worth benchmarking both ways if you take the time to hand-edit the asm. (The gcc-generated asm with extra uops is just bad, including a completely redundant movzx
after right-shifting by 24, leaving the upper bits already zero.)
(参见它时,与标量版本一起):
(See it on the Godbolt compiler explorer, along with a scalar version):
#include <immintrin.h>
#include <stdint.h>
#include <string.h>
#include <stdalign.h>
#ifdef IACA_MARKS
#include "/opt/iaca-3.0/iacaMarks.h"
#else
#define IACA_START
#define IACA_END
#endif
void hist_gatherscatter(unsigned idx0, unsigned idx1,
uint64_t Out0[256], uint64_t Out1[256],
__m128i vdata) {
// gather load from Out[0][?] and Out[1][?] with movq / movhps
__m128i hist = _mm_loadl_epi64((__m128i*)&Out0[idx0]);
hist = _mm_castps_si128( // movhps into the high half
_mm_loadh_pi(_mm_castsi128_ps(hist), (__m64*)&Out1[idx1]));
// xorps could bottleneck on port5.
// Actually probably not, using __m128 the whole time would be simpler and maybe not confuse clang
hist = _mm_xor_si128(hist, vdata);
// scatter store with movq / movhps
_mm_storel_epi64((__m128i*)&Out0[idx0], hist);
_mm_storeh_pi((__m64*)&Out1[idx1], _mm_castsi128_ps(hist));
}
void ext(uint64_t*);
void xor_histo_avx(uint8_t *Indices, const uint64_t *data, size_t len)
{
alignas(32) uint64_t Out[4][256] = {{0}};
// optional: peel the first iteration and optimize away loading the old known-zero values from Out[0..3][Indices[0..3]].
if (len<3) // not shown: cleanup for last up-to-3 elements.
return;
for (size_t i = 0 ; i<len ; i+=4) {
IACA_START
// attempt to hand-hold compiler into a dword load + shifts to extract indices
// to reduce load-port pressure
struct { uint8_t idx[4]; } idx;
#if 0
memcpy(&idx, Indices+i, sizeof(idx)); // safe with strict-aliasing and possibly-unaligned
//gcc makes stupid asm for this, same as for memcpy into a struct,
// using a dword load into EAX (good),
// then AL/AH for the first 2 (good)
// but then redundant mov and movzx instructions for the high 2
// clang turns it into 4 loads
/*
//Attempt to hand-hold gcc into less-stupid asm
//doesn't work: same asm as the struct
uint32_t tmp;
memcpy(&tmp, Indices+i, sizeof(tmp)); // mov eax,[mem]
idx.idx[0] = tmp; //movzx reg, AL
idx.idx[1] = tmp>>8; //movzx reg, AH
tmp >>= 16; //shr eax, 16
idx.idx[2] = tmp; //movzx reg, AL
idx.idx[3] = tmp>>8; //movzx reg, AH
*/
#else
// compiles to separate loads with gcc and clang
idx.idx[0] = Indices[i+0];
idx.idx[1] = Indices[i+1];
idx.idx[2] = Indices[i+2];
idx.idx[3] = Indices[i+3];
#endif
__m128i vd = _mm_load_si128((const __m128i*)&data[i]);
hist_gatherscatter(idx.idx[0], idx.idx[1], Out[0], Out[1], vd);
vd = _mm_load_si128((const __m128i*)&data[i+2]);
hist_gatherscatter(idx.idx[2], idx.idx[3], Out[2], Out[3], vd);
}
IACA_END
// hand-hold compilers into a pointer-increment loop
// to avoid indexed addressing modes. (4/5 speedup on HSW/SKL if all the stores use port7)
__m256i *outp = (__m256i*)&Out[0];
__m256i *endp = (__m256i*)&Out[3][256];
for (; outp < endp ; outp++) {
outp[0] ^= outp[256/4*1];
outp[0] ^= outp[256/4*2];
outp[0] ^= outp[256/4*3];
}
// This part compiles horribly with -mno-avx, but does compile
// because I used GNU C native vector operators on __m256i instead of intrinsics.
/*
for (int i=0 ; i<256 ; i+=4) {
// use loadu / storeu if Out isn't aligned
__m256i out0 = _mm256_load_si256(&Out[0][i]);
__m256i out1 = _mm256_load_si256(&Out[1][i]);
__m256i out2 = _mm256_load_si256(&Out[2][i]);
__m256i out3 = _mm256_load_si256(&Out[3][i]);
out0 = _mm256_xor_si256(out0, out1);
out0 = _mm256_xor_si256(out0, out2);
out0 = _mm256_xor_si256(out0, out3);
_mm256_store_si256(&Out[0][i], out0);
}
*/
//ext(Out[0]); // prevent optimizing away the work
asm("" :: "r"(Out) : "memory");
}
使用gcc7.3 -std=gnu11 -DIACA_MARKS -O3 -march=skylake -mno-avx
进行了编译,并使用IACA-3.0进行了分析:
Compiled with gcc7.3 -std=gnu11 -DIACA_MARKS -O3 -march=skylake -mno-avx
, and analyzed with IACA-3.0:
$ /opt/iaca-3.0/iaca xor-histo.iaca.o Intel(R) Architecture Code Analyzer Version - v3.0-28-g1ba2cbb build date: 2017-10-23;16:42:45
Analyzed File - xor-histo.iaca.o
Binary Format - 64Bit
Architecture - SKL
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 5.79 Cycles Throughput Bottleneck: FrontEnd
Loop Count: 22 (this is fused-domain uops. It's actually 20, so a 5 cycle front-end bottleneck)
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
--------------------------------------------------------------------------------------------------
| Cycles | 2.0 0.0 | 3.0 | 5.5 5.1 | 5.5 4.9 | 4.0 | 3.0 | 2.0 | 3.0 |
--------------------------------------------------------------------------------------------------
DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
-----------------------------------------------------------------------------------------
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | movzx r8d, byte ptr [rdi]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | movzx edx, byte ptr [rdi+0x2]
| 1 | | | | | | | 1.0 | | add rdi, 0x4
| 1 | | | | | | | 1.0 | | add rsi, 0x20
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | movzx eax, byte ptr [rdi-0x1]
| 1 | | 1.0 | | | | | | | lea r12, ptr [rcx+r8*8]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | movzx r8d, byte ptr [rdi-0x3]
| 1 | | 1.0 | | | | | | | lea rdx, ptr [r10+rdx*8]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | movq xmm0, qword ptr [r12]
| 1 | | | | | | 1.0 | | | lea rax, ptr [r9+rax*8]
| 1 | | 1.0 | | | | | | | lea r8, ptr [r11+r8*8]
| 2 | | | 0.5 0.5 | 0.5 0.5 | | 1.0 | | | movhps xmm0, qword ptr [r8] # Wrong, 1 micro-fused uop on SKL
| 2^ | 1.0 | | 0.5 0.5 | 0.5 0.5 | | | | | pxor xmm0, xmmword ptr [rsi-0x20]
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | movq qword ptr [r12], xmm0 # can run on port 7, IDK why IACA chooses not to model it there
| 2^ | | | | | 1.0 | | | 1.0 | movhps qword ptr [r8], xmm0
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | movq xmm0, qword ptr [rdx]
| 2 | | | 0.5 0.5 | 0.5 0.5 | | 1.0 | | | movhps xmm0, qword ptr [rax] # Wrong, 1 micro-fused uop on SKL
| 2^ | 1.0 | | 0.5 0.5 | 0.5 0.5 | | | | | pxor xmm0, xmmword ptr [rsi-0x10]
| 2^ | | | | | 1.0 | | | 1.0 | movq qword ptr [rdx], xmm0
| 2^ | | | | | 1.0 | | | 1.0 | movhps qword ptr [rax], xmm0
| 1* | | | | | | | | | cmp rbx, rdi
| 0*F | | | | | | | | | jnz 0xffffffffffffffa0
Total Num Of Uops: 29 (This is unfused-domain, and a weird thing to total up).
Godbolt上的
gcc8.1对pxor
使用缩放索引寻址模式,对索引和data[]
使用相同的计数器,从而节省了add
.
gcc8.1 on Godbolt uses a scaled-index addressing mode for pxor
, using the same counter for Indices and data[]
, so that saves an add
.
clang不使用LEA,并且瓶颈每7个循环出现4个i
,因为没有任何存储uops可以在端口7上运行.
clang doesn't use LEA, and bottlenecks at 4 i
s per 7 cycles, because none of the store uops can run on port 7.
标量版本(仍使用4个Out[4][256]
切片):
The scalar version (still using 4 slices of Out[4][256]
):
$ iaca.sh -mark 2 xor-histo.iaca.o
Intel(R) Architecture Code Analyzer Version - 2.3 build:246dfea (Thu, 6 Jul 2017 13:38:05 +0300)
Analyzed File - xor-histo.iaca.o
Binary Format - 64Bit
Architecture - SKL
Analysis Type - Throughput
*******************************************************************
Intel(R) Architecture Code Analyzer Mark Number 2
*******************************************************************
Throughput Analysis Report
--------------------------
Block Throughput: 7.24 Cycles Throughput Bottleneck: FrontEnd
Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
---------------------------------------------------------------------------------------
| Cycles | 3.0 0.0 | 3.0 | 6.2 4.5 | 6.8 4.5 | 4.0 | 3.0 | 3.0 | 0.0 |
---------------------------------------------------------------------------------------
N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | | mov eax, dword ptr [rdi]
| 1 | 0.4 | 0.5 | | | | 0.1 | | | | add rdi, 0x4
| 1 | | 0.7 | | | | 0.3 | | | | add rsi, 0x20
| 1* | | | | | | | | | | movzx r9d, al
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | | mov rdx, qword ptr [rbp+r9*8-0x2040]
| 2^ | | 0.3 | 0.5 0.5 | 0.5 0.5 | | 0.3 | 0.4 | | | xor rdx, qword ptr [rsi-0x20]
| 2 | | | 0.5 | 0.5 | 1.0 | | | | | mov qword ptr [rbp+r9*8-0x2040], rdx # wrong, HSW/SKL can keep indexed stores fused
| 1* | | | | | | | | | | movzx edx, ah
| 1 | | | | | | 0.4 | 0.6 | | | add rdx, 0x100
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | | mov r9, qword ptr [rbp+rdx*8-0x2040]
| 2^ | 0.6 | 0.2 | 0.5 0.5 | 0.5 0.5 | | 0.2 | 0.1 | | | xor r9, qword ptr [rsi-0x18]
| 2 | | | 0.2 | 0.8 | 1.0 | | | | | mov qword ptr [rbp+rdx*8-0x2040], r9 # wrong, HSW/SKL can keep indexed stores fused
| 1* | | | | | | | | | | mov edx, eax # gcc code-gen isn't great, but not as bad as in the SIMD loop. No extra movzx, but not taking advantage of AL/AH
| 1 | 0.4 | | | | | | 0.6 | | | shr eax, 0x18
| 1 | 0.8 | | | | | | 0.2 | | | shr edx, 0x10
| 1 | | 0.6 | | | | 0.3 | | | | add rax, 0x300
| 1* | | | | | | | | | | movzx edx, dl
| 1 | 0.2 | 0.1 | | | | 0.5 | 0.2 | | | add rdx, 0x200
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | | mov r9, qword ptr [rbp+rdx*8-0x2040]
| 2^ | | 0.6 | 0.5 0.5 | 0.5 0.5 | | 0.3 | 0.1 | | | xor r9, qword ptr [rsi-0x10]
| 2 | | | 0.5 | 0.5 | 1.0 | | | | | mov qword ptr [rbp+rdx*8-0x2040], r9 # wrong, HSW/SKL can keep indexed stores fused
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | | mov rdx, qword ptr [rbp+rax*8-0x2040]
| 2^ | | | 0.5 0.5 | 0.5 0.5 | | 0.6 | 0.4 | | | xor rdx, qword ptr [rsi-0x8]
| 2 | | | 0.5 | 0.5 | 1.0 | | | | | mov qword ptr [rbp+rax*8-0x2040], rdx # wrong, HSW/SKL can keep indexed stores fused
| 1 | 0.6 | | | | | | 0.4 | | | cmp r8, rdi
| 0F | | | | | | | | | | jnz 0xffffffffffffff75
Total Num Of Uops: 33
该循环比IACA计算的结果短4个融合域,因为它不知道只有SnB/IvB可以对未分层索引的存储区进行存储. HSW/SKL不需要.但是,此类存储仍然无法使用端口7,因此对于4个元素,这不会比6.5个周期更好.
The loop is 4 fused-domain uops shorter than what IACA counts, because it doesn't know that only SnB/IvB un-laminate indexed stores. HSW/SKL don't. Such stores still can't use port 7, though, so this won't get any better than ~6.5 cycles for 4 elements.
(还有BTW,通过对天真[i]的天真处理,分别使用movzx加载每个循环,您将获得4个元素的8个周期,使端口2和3饱和.即使gcc不会生成用于解包的吞吐量最佳代码在结构上,通过减轻一些加载端口压力,可以使4字节加载+拆包成为赢家.)
(And BTW, with naive handling of Indices[i], loading each one separately with movzx, you get 8 cycles for 4 elements, saturating ports 2 and 3. Even though gcc doesn't generate throughput-optimal code for unpacking the struct, the 4-byte load + unpack should be a net win by relieving some load-port pressure.)
清理循环:
AVX2确实在这里闪耀:我们遍历直方图的最低切片,而在其他切片中进行XOR.此循环是在Skylake上有4个负载的8个前端uops,并且应每2个时钟以1个iter运行:
AVX2 really shines here: we loop over the lowest slice of the histogram, and XOR in the other slices. This loop is 8 front-end uops with 4 loads on Skylake, and should run at 1 iter per 2 clocks:
.L7:
vmovdqa ymm2, YMMWORD PTR [rax+4096]
vpxor ymm0, ymm2, YMMWORD PTR [rax+6144]
vmovdqa ymm3, YMMWORD PTR [rax]
vpxor ymm1, ymm3, YMMWORD PTR [rax+2048]
vpxor ymm0, ymm0, ymm1
vmovdqa YMMWORD PTR [rax], ymm0
add rax, 32
cmp rax, rdx
jne .L7
我试图通过在一个链中执行XOR来进一步减少uop计数,但是gcc坚持要执行两次vmovdqa
加载,并且必须执行一次vpxor
而没有内存操作数. (OoO执行人员会隐藏VPXOR的这个微小链/树的延迟,所以没关系.)
I tried to reduce the uop count further by doing the XORs in one chain, but gcc insists on doing two vmovdqa
loads and having to do one vpxor
without a memory operand. (OoO exec will hide the latency of this tiny chain / tree of VPXOR so it doesn't matter.)
不,您将使用聚集来获取旧值,然后使用SIMD XOR,然后将更新后的元素散布到它们来自的位置.
No, you'd use a gather to get the old values, then SIMD XOR, then scatter the updated elements back to the locations they came from.
为避免冲突,您可能需要out[8][256]
,以便每个向量元素都可以使用不同的表. (否则,如果Indices[i+0]
和Indices[i+4]
相等,则会出现问题,因为散布存储将只存储具有该索引的最高矢量元素.
To avoid conflicts, you might want out[8][256]
so every vector element can use a different table. (Otherwise you have a problem if Indices[i+0]
and Indices[i+4]
were equal, because the scatter store would just store the highest vector element with that index.
散点图/聚集指令只需要一个基址寄存器,但是您可以在执行vpmovzxbq
零扩展加载后简单地添加_mm256_setr_epi64(0, 256, 256*2, ...);
.
Scatter/gather instructions need a single base register, but you can simply add _mm256_setr_epi64(0, 256, 256*2, ...);
after doing a vpmovzxbq
zero-extending load.
注释
我使用IACA2.3进行标量分析,因为当一个文件中有多个标记时,IACA3.0似乎已经删除了-mark
选项来选择要分析的循环.在这种情况下,IACA3.0并未解决IACA2.3对SKL管道错误的任何方式.
I used IACA2.3 for the scalar analysis because IACA3.0 seems to have removed the -mark
option to choose which loop to analyze when you have multiple marks in one file. IACA3.0 didn't fix any of the ways that IACA2.3 is wrong about SKL's pipeline in this case.
这篇关于使用AVX2指令选择性地对列表元素进行异或的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!