动态生成向量常数的最佳指令序列是什么?

在 Agner Fog 指令表中的任何 CPU 上，W 版本与 pcmpeq 的字节或双字元素大小版本相比没有任何优势，但是 pcmpeqQ 需要一个额外的字节，在 Silvermont 上速度较慢，并且需要 SSE4.1.SO 没有真正的表格格式，所以我只是要列出对 Agner Fog 的表 13.10 的补充，而不是改进的版本.对不起.也许如果这个答案变得流行，我会使用 ascii-art 表生成器，但希望改进将纳入指南的未来版本.主要难点是 8 位向量，因为有没有 PSLLBAgner Fog 的表生成 16 位元素的向量并使用 packuswb 来解决这个问题.例如，pcmpeqw xmm0,xmm0/psrlw xmm0,15/psllw xmm0,1/packuswb xmm0,xmm0生成一个向量，其中每个字节都是 2.(这种具有不同计数的移位模式是为更宽的向量生成大多数常量的主要方法).有一个更好的方法:paddb xmm0,xmm0 (SSE2) 以字节粒度左移一位，所以 -2 字节的向量可以只用两条指令生成(pcmpeqw/paddb).paddw/d/q 作为其他元素大小的左移与移位相比，可以节省一个字节的机器代码，并且通常可以在比 shift-imm 更多的端口上运行.>pabsb xmm0,xmm0 (SSSE3) 将全 1 的向量 (-1) 转换为 1 的向量字节，并且是非破坏性的，因此您仍然拥有 set1(-1) 向量.(您有时不需要 set1(1).您可以通过用 psubb 减去 -1 来为每个元素加 1.)我们可以用pcmpeqw/paddb/pabsb 生成2 字节.(添加与 abs 的顺序无关紧要).pas 不需要 imm8，但当两者都需要 3 字节 VEX 前缀时，它只为其他元素宽度与右移保存代码字节.这仅在源寄存器为 xmm8-15 时发生.(vpabsb/w/d 总是需要 VEX.128.66.0F38.WIG 的 3 字节 VEX 前缀，但 vpsrlw dest,src,imm> 否则可以为其 VEX.NDD.128.66.0F.WIG 使用 2 字节的 VEX 前缀.我们实际上也可以节省生成4字节的指令:pcmpeqw/pabsb/psllw xmm0, 2.多亏了 pabsb，通过字移跨字节边界移动的所有位都为零.显然，其他移位计数可以将单个设置位放在其他位置，包括符号位以生成 -128 (0x80) 个字节的向量.请注意，pabsb 是非破坏性的(目标操作数是只写的，不需要与源操作数相同即可获得所需的行为).您可以将全 1 保持为常量，或者作为生成另一个常量的开始，或者作为 psubb 的源操作数(增加一).0x80 字节的向量也可以(见上一段)使用packsswb从饱和到-128的任何东西中生成.例如如果您已经有一个 0xFF00 向量用于其他用途，只需复制它并使用 packsswb.从内存中加载的恰好正确饱和的常量是潜在的目标.0x7f 字节的向量可以用pcmpeqw/psrlw xmm0, 9/ 生成packuswb xmm0,xmm0.我将此视为不明显"，因为大多数设置的性质并没有让我想到只是将其作为每个单词中的一个值并执行通常的 packuswb.pavgb (SSE2) 针对清零寄存器可以右移 1，但前提是值是偶数.(它使用无符号 dst = (dst+src+1)>>1 进行舍入，临时使用 9 位内部精度.)这似乎对常量没有用-不过，因为 0xff 是奇数:pxor xmm1,xmm1/pcmpeqw xmm0,xmm0/paddb xmm0,xmm0/pavgb xmm0, xmm1 产生 0x7f 字节比 shift/pack 多一个 insn.但是，如果其他内容已经需要清零寄存器，paddb/pavgb 确实会节省一个指令字节.我已经测试了这些序列.最简单的方法是将它们放入 .asm 中，组装/链接，然后在其上运行 gdb.layout asm, display/x $xmm0.v16_int8 在每一步后转储它，以及单步指令(ni 或 si).在 layout reg 模式下，你可以做 tui reg vec 切换到矢量 regs 的显示，但它几乎没用，因为你不能选择要显示的解释(你总是得到所有的，并且不能hscroll，并且列在寄存器之间不对齐).不过，它非常适合整数 regs/flags.请注意，将这些与内在函数一起使用可能会很棘手.编译器不喜欢对未初始化的变量进行操作，因此您应该使用 _mm_undefined_si128() 告诉编译器这就是你的意思.或者也许使用 _mm_set1_epi32(-1) 会让你的编译器发出一个 pcmpeqd same,same.如果没有这个，一些编译器会在使用前对未初始化的向量变量进行异或零处理，甚至 (MSVC) 从堆栈中加载未初始化的内存.通过利用 SSE4.1 的 pmovzx 或 pmovsx 进行零或符号扩展，可以更紧凑地将许多常量存储在内存中.例如，{1, 2, 3, 4} 作为 32 位元素的 128b 向量可以通过从 32 位内存位置加载 pmovzx 生成.内存操作数可以与 pmovzx 进行微融合，因此不需要任何额外的融合域 uops.不过，它确实阻止了将常量直接用作内存操作数.C/C++ 使用 pmovz/sx 作为负载的内在支持很糟糕:有 _mm_cvtepu8_epi32 (__m128i a)，但没有采用 uint32_t * 指针操作数的版本.你可以绕过它，但它很丑陋，编译器优化失败是一个问题.有关详细信息和 gcc 错误报告的链接，请参阅链接的问题.使用 256b 和(很快)512b 常量，节省的内存更大.不过，这只有在多个有用的常量可以共享一个缓存行时才非常重要.与此等效的 FP 是 VCVTPH2PS xmm1, xmm2/m64，需要 F16C(半精度)功能标志.(还有一个存储指令，将单个打包到一半，但没有半精度计算.这只是内存带宽/缓存占用优化.)显然当所有元素都相同(但不适合动态生成)时，pshufd 或 AVX vbroadcastps/AVX2 vpbroadcastb/w/d/q/i128 很有用.pshufd 可以采用内存源操作数，但必须是 128b.movddup (SSE3) 执行 64 位加载，广播以填充 128b 寄存器.在 Intel 上，它不需要 ALU 执行单元，只需要加载端口.(类似地，双字大小和更大的 AVX v[p]broadcast 加载在加载单元中处理，没有 ALU.广播或 pmovz/sx 非常适合节省可执行文件的大小，当您要将掩码加载到寄存器中以在循环中重复使用时.如果只需要一条指令，从一个起点生成多个相似的掩码也可以节省空间.另见对于具有所有相同组件的 SSE 向量，动态生成还是预计算? 询问更多关于使用 set1 内在函数的信息，它不是不清楚它是询问常量还是变量广播.我还尝试了一些广播的编译器输出.如果缓存未命中是个问题，请查看您的代码，看看当同一个函数被内联到不同的调用者时，编译器是否重复了 _mm_set 常量.还要注意一起使用的常量(例如在一个接一个调用的函数中)分散到不同的缓存行中.许多分散的常量加载比加载大量彼此靠近的常量要糟糕得多.pmovzx 和/或广播加载让您可以将更多常量打包到缓存行中，将它们加载到寄存器中的开销非常低.负载不会在关键路径上，因此即使需要额外的 uop，它也可以在很长的窗口中的任何周期内占用一个空闲的执行单元.铛实际上在这方面做得很好:不同函数中的单独 set1 常量被认为是相同的，相同的字符串文字可以合并的方式.请注意，clang 的 asm 源输出似乎显示每个函数都有自己的常量副本，但二进制反汇编显示所有这些 RIP 相关的有效地址都引用了相同的位置.对于 256b 版本的重复函数，clang 还使用 vbroadcastsd 只需要 8B 的负载，代价是每个函数中都有一条额外的指令.(这是在 -O3 处，所以很明显，clang 开发人员已经意识到大小对性能很重要，而不仅仅是对 -Os 而言).IDK 为什么它不使用 vbroadcastss 归结为 4B 常量，因为那应该同样快.不幸的是，vbroadcast 不仅仅来自其他函数使用的 16B 常量的一部分.这可能是有道理的:某物的 AVX 版本可能只能将其某些常量与 SSE 版本合并.最好让带有 SSE 常量的内存页面完全处于冷态，并让 AVX 版本将所有常量保存在一起.此外，这是一个更难在汇编或链接时处理的模式匹配问题(但是它已经完成.我没有阅读每个指令来确定哪个指令启用合并.) GCC 5.3 也合并常量，但不使用广播加载来压缩 32B 常量.同样，16B 常数与 32B 常数不重叠."Best" means fewest instructions (or fewest uops, if any instructions decode to more than one uop). Machine-code size in bytes is a tie-breaker for equal insn count.Constant-generation is by its very nature the start of a fresh dependency chain, so it's unusual for latency to matter. It's also unusual to generate constants inside a loop, so throughput and execution-port demands are also mostly irrelevant.Generating constants instead of loading them takes more instructions (except for all-zero or all-one), so it does consume precious uop-cache space. This can be an even more limited resource than data cache.Agner Fog's excellent Optimizing Assembly guide covers this in Section 13.4. Table 13.10 has sequences for generating vectors where every element is 0, 1, 2, 3, 4, -1, or -2, with element sizes from 8 to 64 bits. Table 13.11 has sequences for generating some floating point values (0.0, 0.5, 1.0, 1.5, 2.0, -2.0, and bitmasks for the sign bit.)Agner Fog's sequences only use SSE2, either by design or because it hasn't been updated for a while.What other constants can be generated with short non-obvious sequences of instructions? (Further extensions with different shift counts are obvious and not "interesting".) Are there better sequences for generating the constants Agner Fog does list?How to move 128-bit immediates to XMM registers illustrates some ways to put an arbitrary 128b constant into the instruction stream, but that's usually not sensible (it doesn't save any space, and takes up lots of uop-cache space.) 解决方案 All-zero: pxor xmm0,xmm0 (or xorps xmm0,xmm0, one instruction-byte shorter.) There isn't much difference on modern CPUs, but on Nehalem (before xor-zero elimination), the xorps uop could only run on port 5. I think that's why compilers favour pxor-zeroing even for registers that will be used with FP instructions.All-ones: pcmpeqw xmm0,xmm0. This is the usual starting point for generating other constants, because (like pxor) it breaks the dependency on the previous value of the register (except on old CPUs like K10 and pre-Core2 P6).There's no advantage to the W version over the byte or dword element size versions of pcmpeq on any CPU in Agner Fog's instruction tables, but pcmpeqQ takes an extra byte, is slower on Silvermont, and requires SSE4.1.SO doesn't really have table formatting, so I'm just going to list additions to Agner Fog's table 13.10, rather than an improved version. Sorry. Maybe if this answer becomes popular, I'll use an ascii-art table-generator, but hopefully improvements will be rolled into future versions of the guide.The main difficulty is 8-bit vectors, because there's no PSLLBAgner Fog's table generates vectors of 16-bit elements and uses packuswb to work around this. For example, pcmpeqw xmm0,xmm0 / psrlw xmm0,15 / psllw xmm0,1 / packuswb xmm0,xmm0 generates a vector where every byte is 2. (This pattern of shifts, with different counts, is the main way to produce most constants for wider vectors). There is a better way:paddb xmm0,xmm0 (SSE2) works as a left-shift by one with byte granularity, so a vector of -2 bytes can be generated with only two instructions (pcmpeqw / paddb). paddw/d/q as a left-shift-by-one for other element sizes saves one byte of machine code compared to shifts, and can generally run on more ports than a shift-imm.pabsb xmm0,xmm0 (SSSE3) turns a vector of all-ones (-1) into a vector of 1 bytes, and is non-destructive so you still have the set1(-1) vector.(You sometimes don't need set1(1). You can add 1 to every element by subtracting -1 with psubb instead.)We can generate 2 bytes with pcmpeqw / paddb / pabsb. (Order of add vs. abs doesn't matter). pabs doesn't need an imm8, but only saves code bytes for other element widths vs. right shifting when both require a 3-byte VEX prefix. This only happens when the source register is xmm8-15. (vpabsb/w/d always requires a 3-byte VEX prefix for VEX.128.66.0F38.WIG, but vpsrlw dest,src,imm can otherwise use a 2-byte VEX prefix for its VEX.NDD.128.66.0F.WIG).We can actually save instructions in generating 4 bytes, too: pcmpeqw / pabsb / psllw xmm0, 2. All the bits that are shifted across byte boundaries by the word-shift are zero, thanks to pabsb. Obviously other shift counts can put the single set-bit at other locations, including the sign bit to generate a vector of -128 (0x80) bytes. Note that pabsb is non-destructive (the destination operand is write-only, and doesn't need to be the same as the source to get the desired behaviour). You can keep the all-ones around as a constant, or as the start of generating another constant, or as a source operand for psubb (to increment by one).A vector of 0x80 bytes can be also (see prev paragraph) be generated from anything that saturates to -128, using packsswb. e.g. if you already have a vector of 0xFF00 for something else, just copy it and use packsswb. Constants loaded from memory that happen to saturate correctly are potential targets for this.A vector of 0x7f bytes can be generated with pcmpeqw / psrlw xmm0, 9 / packuswb xmm0,xmm0. I'm counting this as "non obvious" because the mostly-set nature didn't make me think of just generating it as a value in each word and doing the usual packuswb.pavgb (SSE2) against a zeroed register can right-shift by one, but only if the value is even. (It does unsigned dst = (dst+src+1)>>1 for rounding, with 9-bit internal precision for the temporary.) This doesn't seem to be useful for constant-generation, though, because 0xff is odd: pxor xmm1,xmm1 / pcmpeqw xmm0,xmm0 / paddb xmm0,xmm0 / pavgb xmm0, xmm1 produces 0x7f bytes with one more insn than shift/pack. If a zeroed register is already needed for something else, though, paddb / pavgb does save one instruction byte.I have tested these sequences. The easiest way is to throw them in a .asm, assemble/link, and run gdb on it. layout asm, display /x $xmm0.v16_int8 to dump that after every single-step, and single-step instructions (ni or si). In layout reg mode, you can do tui reg vec to switch to a display of vector regs, but it's nearly useless because you can't select which interpretation to display (you always get all of them, and can't hscroll, and the columns don't line up between registers). It's excellent for integer regs/flags, though.Note that using these with intrinsics can be tricky. Compilers don't like to operate on uninitialized variables, so you should use _mm_undefined_si128() to tell the compiler that's what you meant. Or perhaps using _mm_set1_epi32(-1) will get your compiler to emit a pcmpeqd same,same. Without this, some compilers will xor-zero uninitialized vector variables before use, or even (MSVC) load uninitialized memory from the stack.Many constants can be stored more compactly in memory by taking advantage of SSE4.1's pmovzx or pmovsx for zero or sign-extension on the fly. For example, a 128b vector of {1, 2, 3, 4} as 32bit elements could be generated with a pmovzx load from a 32bit memory location. Memory operands can micro-fuse with pmovzx, so it doesn't take any extra fused-domain uops. It does prevent using the constant directly as a memory operand, though.C/C++ intrinsics support for using pmovz/sx as a load is terrible: there's _mm_cvtepu8_epi32 (__m128i a), but no version that takes a uint32_t * pointer operand. You can hack around it, but it's ugly and compiler optimization failure is a problem. See the linked question for details and links to the gcc bug reports.With 256b and (not so) soon 512b constants, the savings in memory are larger. This only matters very much if multiple useful constants can share a cache-line, though.The FP equivalent of this is VCVTPH2PS xmm1, xmm2/m64, requiring the F16C (half precision) feature flag. (There's also a store instruction that packs single to half, but no computation at half precision. It's a memory bandwidth / cache footprint optimization only.)Obviously when all elements are the same (but not suitable for generating on the fly), pshufd or AVX vbroadcastps / AVX2 vpbroadcastb/w/d/q/i128 are useful. pshufd can take a memory source operand, but it has to be 128b. movddup (SSE3) does a 64bit load, broadcast to fill a 128b register. On Intel, it doesn't need an ALU execution unit, only load port. (Similarly, AVX v[p]broadcast loads of dword size and larger are handled in the load unit, without ALU).Broadcasts or pmovz/sx are excellent for saving executable size when you're going to load a mask into a register for repeated use in a loop. Generating multiple similar masks from one starting point can also save space, if it only takes one instruction.See also For for an SSE vector that has all the same components, generate on the fly or precompute? which is asking more about using the set1 intrinsic, and it isn't clear if it's asking about constants or broadcasts of variables.I also experimented some with compiler output for broadcasts.If cache misses are a problem, take a look at your code and see if the compiler has duplicated _mm_set constants when the same function is inlined into different callers. Also watch out for constants that are used together (e.g. in functions called one after another) being scattered into different cache lines. Many scattered loads for constants is far worse than loading a lot of constants all from near each other.pmovzx and/or broadcast loads let you pack more constants into a cache line, with very low overhead for loading them into a register. The load won't be on the critical path, so even if it takes an extra uop, it can take a free execution unit at any cycle over a long window.clang actually does a good job of this: separate set1 constants in different functions are recognized as identical, the way identical string literals can be merged. Note that clang's asm source output appears to show each function having its own copy of the constant, but the binary disassembly shows that all those RIP-relative effective addresses are referencing the same location. For 256b versions of the repeated functions, clang also uses vbroadcastsd to only require an 8B load, at the expense of an extra instruction in each function. (This is at -O3, so clearly the clang devs have realized that size matters for performance, not just for -Os). IDK why it doesn't go down to a 4B constant with vbroadcastss, because that should be just as fast. Unfortunately, the vbroadcast don't simply come from part of the 16B constant the other functions used. This maybe makes sense: an AVX version of something could probably only merge some of its constants with an SSE version. It's better to leave the memory pages with SSE constants completely cold, and have the AVX version keep all its constants together. Also, it's a harder pattern-matching problem to be handled at assemble or link time (however it's done. I didn't read every directive to figure out which one enables the merging.)gcc 5.3 also merges constants, but doesn't use broadcast-loads to compress 32B constants. Again the 16B constant doesn't overlap with the 32B constant. 这篇关于动态生成向量常数的最佳指令序列是什么?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！上岸，阿里云！