问题描述
如何将屏蔽寄存器的所有设置位向右移动?(到最下面的最低位置).
How can I move all set bits of mask register to right? (To the bottom, least-significant position).
例如:
__mmask16 mask = _mm512_cmpeq_epi32_mask(vload, vlimit); // mask = 1101110111011101
如果将所有设置的位右移,则会得到: 1101110111011101->0000111111111111
If we move all set bits to the right, we will get: 1101110111011101 -> 0000111111111111
如何有效地做到这一点?
How can I achieve this efficiently?
下面您可以看到我如何尝试获得相同的结果,但是效率低下:
Below you can see how I tried to get the same result, but it's inefficient:
__mmask16 mask = 56797;
// mask: 1101110111011101
__m512i vbrdcast = _mm512_maskz_broadcastd_epi32(mask, _mm_set1_epi32(~0));
// vbrdcast: -1 0 -1 -1 -1 0 -1 -1 -1 0 -1 -1 -1 0 -1 -1
__m512i vcompress = _mm512_maskz_compress_epi32(mask, vbrdcast);
// vcompress:-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 0 0
__mmask16 right_packed_mask = _mm512_movepi32_mask(vcompress);
// right_packed_mask: 0000111111111111
做到这一点的最佳方法是什么?
What is the best way to do this?
推荐答案
BMI2 pext
是 v [p] compressed/q/ps/pd
.
用它在您的遮罩值上将其左包装到值的底部.
BMI2 pext
is the scalar bitwise equivalent of v[p]compressd/q/ps/pd
.
Use it on your mask value to left-pack them to the bottom of the value.
mask = _pext_u32(-1U, mask); // or _pext_u64(-1ULL, mask64) for __mmask64
// costs 3 asm instructions (kmov + pext + kmov) if you need to use the result as a mask
// not including putting -1 in a register.
__ mmask16(在GCC中也称为uint16_t)和uint32_t之间的隐式转换有效.
如果愿意,可以使用 _cvtu32_mask16
和 _cvtu32_mask16
明确显示KMOVW.
Implicit conversion between __mmask16 (aka uint16_t in GCC) and uint32_t works.
Use _cvtu32_mask16
and _cvtu32_mask16
to make the KMOVW explicit if you like.
请参见如何取消设置N个最右边的设置位有关以这种方式使用pext/pdep的更多信息.
See How to unset N right-most set bits for more about using pext/pdep in ways like this.
当前所有装有AVX-512的CPU都具有快速的BMI2 pext
(包括Xeon Phi),其性能与popcnt相同,但是如果AMD引入AVX-512个CPU.对于AMD,您可能需要(1ULL<< __ builtin_popcount(mask))-1
,因为pext/pdep在当前的AMD上非常慢.
All current CPUs with AVX-512 also have fast BMI2 pext
(including Xeon Phi), same performance as popcnt, although that may change if AMD ever introduces AVX-512 CPUs. For AMD you might want(1ULL << __builtin_popcount(mask)) - 1
because pext/pdep are very slow on current AMD.
如果要使用 vpcompressed
,请注意,源向量可以简单地是全为 _mm512_set1_epi32(-1)
;compress并不关心遮罩为零的元素,它们不必已经为零.
If you were going to use vpcompressd
, note that the source vector can simply be all-ones _mm512_set1_epi32(-1)
; compress doesn't care about elements where the mask was zero, they don't need to already be zero.
(打包的 -1
没关系;一旦使用布尔值, true之间就没有区别来自原始位掩码的code>与位于其中的常量
true
相比,您生成的费用更低,而且不依赖于输入掩码. pext
,为什么可以使用 -1U
代替 pdep
作为源数据,即 -1
或置位位没有身份;它与任何其他 -1
或设置的位相同).
(It doesn't matter which -1
s you pack; once you're working with boolean values, there's no difference between a true
that came from your original bitmask vs. a constant true
that was just sitting there which you generated more cheaply, without a dependency on your input mask. Same reasoning applies for pext
, why you can use -1U
as the source data instead of a pdep
. i.e. a -1
or set bit doesn't have an identity; it's the same as any other -1
or set bit).
因此,让我们尝试两种方式,看看asm的优缺点.
So let's try both ways and see how good/bad the asm is.
inline
__mmask16 leftpack_k(__mmask16 mask){
return _pdep_u32(-1U, mask);
}
inline
__mmask16 leftpack_comp(__mmask16 mask) {
__m512i v = _mm512_maskz_compress_epi32(mask, _mm512_set1_epi32(-1));
return _mm512_movepi32_mask(v);
}
查看这些文件的独立版本没有用,因为 __ mmask16
是 unsigned short
的typedef,因此在整数寄存器中传递/返回,而不是在 k
寄存器.当然,这使 pext
版本看起来非常好,但是我们想看看它如何内联到我们使用AVX-512内在函数生成和使用掩码的情况下.
Looking at stand-alone versions of these isn't useful because __mmask16
is a typedef for unsigned short
, and is thus passed/returned in integer registers, not k
registers. That makes the pext
version look very good, of course, but we want to see how it inlines into a case where we generate and use the mask with AVX-512 intrinsics.
// not a useful function, just something that compiles to asm in an obvious way
void use_leftpack_compress(void *dst, __m512i v){
__mmask16 m = _mm512_test_epi32_mask(v,v);
m = leftpack_comp(m);
_mm512_mask_storeu_epi32(dst, m, v);
}
注释 m = pack(m)
,这只是生成并使用掩码的2条简单指令.
Commenting out the m = pack(m)
, this is just a simple 2 instructions that generate and then use a mask.
use_mask_nocompress(void*, long long __vector(8)):
vptestmd k1, zmm0, zmm0
vmovdqu32 ZMMWORD PTR [rdi]{k1}, zmm0
ret
因此,任何额外的说明将归因于左包装(压缩)面罩.GCC和clang彼此具有相同的asm,不同之处仅在于clang避免使用 kmovw
,而总是使用 kmovd
.
So any extra instructions will be due to left-packing (compressing) the mask. GCC and clang make the same asm as each other, differing only in clang avoiding kmovw
in favour of always kmovd
. Godbolt
# GCC10.3 -O3 -march=skylake-avx512
use_leftpack_k(void*, long long __vector(8)):
vptestmd k0, zmm0, zmm0
mov eax, -1 # could be hoisted out of a loop
kmovd edx, k0
pdep eax, eax, edx
kmovw k1, eax
vmovdqu32 ZMMWORD PTR [rdi]{k1}, zmm0
ret
use_leftpack_compress(void*, long long __vector(8)):
vptestmd k1, zmm0, zmm0
vpternlogd zmm2, zmm2, zmm2, 0xFF # set1(-1) could be hoisted out of a loop
vpcompressd zmm1{k1}{z}, zmm2
vpmovd2m k1, zmm1
vmovdqu32 ZMMWORD PTR [rdi]{k1}, zmm0
ret
所以不可吊起的部分是
-
kmov r,k
(端口0)/pext
(端口1)/kmov k,r
(端口5)= 3微码,每个执行端口一个.(包括端口1,该端口的向量ALU在飞行512位uos时关闭).kmov/kmov往返行程在SKX上有4个周期的延迟,而pext
是3个周期的延迟,总共7个周期的延迟.
kmov r,k
(port 0) /pext
(port 1) /kmov k,r
(port 5) = 3 uops, one for each execution port. (Including port 1, which has its vector ALUs shut down while 512-bit uops are in flight). The kmov/kmov round trip has 4 cycle latency on SKX, andpext
is 3 cycle latency, for a total of 7 cycle latency.
vpcompressed zmm {k} {z},z
(2 p5)/ vpmovd2m
(端口0)= 3 uops,两个用于端口5. vpmovd2m
具有 SKX上的3个周期延迟/ICL和 vpcompressed
-zeroing-into-zmm从k输入到zmm输出都有6个周期( SKX 和ICL).因此,总共9个周期的延迟时间,并且uops的端口分配更差.
vpcompressd zmm{k}{z}, z
(2 p5) / vpmovd2m
(port 0) = 3 uops, two for port 5. vpmovd2m
has 3 cycle latency on SKX / ICL, and vpcompressd
-zeroing-into-zmm has 6 cycle from the k input to the zmm output (SKX and ICL). So a total of 9 cycle latency, and worse port distribution for the uops.
此外,可吊起的部分通常会更糟( vpternlogd
更长,并且比 mov r32,imm32
竞争更少的端口),除非您的功能已经需要全1向量,但不是所有人的寄存器.
Also, the hoistable part is generally worse (vpternlogd
is longer and competes for fewer ports than mov r32, imm32
), unless your function already needs an all-ones vector for something but not an all-ones register.
结论:BMI2 pext
的方式丝毫不差,在几种方面都更好..(除非周围的代码在端口1 uops上出现大量瓶颈,如果使用512位向量,这几乎是不可能的,因为在那种情况下,它只能运行标量整数uops,例如3周期LEA,IMUL,LZCNT,当然还有简单的1周期整数,例如add/sub/and/or).
Conclusion: the BMI2 pext
way is not worse in any way, and better in several. (Unless surrounding code heavily bottlenecked on port 1 uops, which is very unlikely if using 512-bit vectors because in that case it can only be running scalar integer uops like 3-cycle LEA, IMUL, LZCNT, and of course simple 1-cycle integer stuff like add/sub/and/or).
这篇关于AVX512-如何将所有设置的位右移?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!