在没有AVX2的情况下

在没有AVX2的情况下

本文介绍了在没有AVX2的情况下,如何使用字节中的位设置ymm寄存器中的双字? (vmovmskps的倒数)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要实现的目标是基于一个字节中的每个位,并将其设置为ymm寄存器(或存储位置)中每个dword中的所有位

例如

al = 0110 0001

ymm0 = 0x00000000 FFFFFFFF FFFFFFFF 00000000 00000000 00000000 00000000 FFFFFFFF

vmovmskps eax, ymm0/_mm256_movemask_ps的倒数,将位图转换为矢量掩码.

我在想有一些sse/avx指令可以相对简单地完成此操作,但是我还无法解决.最好是沙桥兼容,所以没有avx2.

解决方案

如果AVX2可用,请参见 ,而不是使用整数的更有效版本SIMD.您可以使用该想法,并将位图分成两个4位的块,以与LUT一起使用.这可能表现得相当不错:vinsertf128在Sandybridge上每时钟吞吐量为1,而在Haswell/Skylake上为每0.5c吞吐量.

使用AVX1的SIMD整数解决方案可以对高/低矢量一半进行两次相同的工作(2x广播位图,2x屏蔽位,2x vpcmpeqd xmm),然后是vinsertf128,但这有点糟./p>

您可能会考虑使用vpbroadcastd ymm0, mem/vpand ymm0, mask/vpcmpeqd dst, ymm0, mask将AVX2版本与仅AVX1版本分开,因为这非常有效,特别是如果您要从内存中加载位图,并且可以读取该位图的整个dword. (dword或qword的广播负载不需要ALU随机播放,因此值得一读). maskset_epi32(1<<7, 1<<6, 1<<5< ..., 1<<0),可以用vpmovzxbd ymm, qword [constant]加载,因此它仅占用8个字节的数据存储空间即可存储8个元素.


内部版本,请参见下面的说明和asm版本.编译有关我们对 -march=sandybridge

上,在Godbolt上strong>

#include <immintrin.h>
// AVX2 can be significantly more efficient, doing this with integer SIMD
// Especially for the case where the bitmap is in an integer register, not memory
// It's fine if `bitmap` contains high garbage; make sure your C compiler broadcasts from a dword in memory if possible instead of integer load with zero extension.
// e.g. __m256 _mm256_broadcast_ss(float *a);  or memcpy to unsigned.
// Store/reload is not a bad strategy vs. movd + 2 shuffles so maybe just do it even if the value might be in a register; it will force some compilers to store/broadcast-load.  But it might not be type-punning safe  even though it's an intrinsic.

// Low bit -> element 0, etc.
__m256 inverse_movemask_ps_avx1(unsigned bitmap)
{
    // if you know DAZ is off: don't OR, just AND/CMPEQ with subnormal bit patterns
    // FTZ is irrelevant, we only use bitwise booleans and CMPPS
    const __m256 exponent = _mm256_set1_ps(1.0f);   // set1_epi32(0x3f800000)
    const __m256 bit_select = _mm256_castsi256_ps(
          _mm256_set_epi32(  // exponent + low significand bits
                0x3f800000 + (1<<7), 0x3f800000 + (1<<6),
                0x3f800000 + (1<<5), 0x3f800000 + (1<<4),
                0x3f800000 + (1<<3), 0x3f800000 + (1<<2),
                0x3f800000 + (1<<1), 0x3f800000 + (1<<0)
          ));

    // bitmap |= 0x3f800000;  // more efficient to do this scalar, but only if the data was in a register to start with
    __m256  bcast = _mm256_castsi256_ps(_mm256_set1_epi32(bitmap));
    __m256  ored  = _mm256_or_ps(bcast, exponent);
    __m256  isolated = _mm256_and_ps(ored, bit_select);
    return _mm256_cmp_ps(isolated, bit_select, _CMP_EQ_OQ);
}

如果有创造力,我们可以使用AVX1 FP指令执行相同的操作. AVX1具有dword广播(vbroadcastss ymm0, mem)和布尔值(vandps).这将产生有效的单精度浮点数的位模式,因此我们可以使用vcmpeqps,但是如果我们将位图位保留在元素的底部,它们都是非正规的.在Sandybridge上,这实际上可能很好:比较异常值可能不会受到任何惩罚.但是,如果您的代码曾经与DAZ(反常态为零)一起运行,它将中断,因此我们应避免这种情况.

我们可以vpor在屏蔽之前或之后设置一些指数,或者我们可以将位图上移到IEEE浮点格式的8位指数字段中.如果您的位图从整数寄存器开始,则将其移位会很好,因为在movd之前的shl eax, 23很便宜. 但是,如果它在内存中启动,则意味着放弃使用便宜的vbroadcastss负载.或者,您可以广播加载到xmm vpslld xmm0, xmm0, 23/vinsertf128 ymm0, xmm0, 1.但这仍然比vbroadcastss/vorps/vandps/vcmpeqps

更糟

(在存储/重新加载之前进行标量OR可以解决相同的问题.)

所以:

# untested
# pointer to bitmap in rdi
inverse_movemask:
    vbroadcastss  ymm0, [rdi]

    vorps         ymm0, ymm0, [set_exponent]   ; or hoist this constant out with a broadcast-load

    vmovaps       ymm7, [bit_select]          ; hoist this out of any loop, too
    vandps        ymm0, ymm0, ymm7
    ; ymm0 exponent = 2^0, mantissa = 0 or 1<<i where i = element number
    vcmpeqps      ymm0, ymm0, ymm7
    ret

section .rodata
ALIGN 32
      ; low bit -> low element.  _mm_setr order
    bit_select: dd 0x3f800000 + (1<<0), 0x3f800000 + (1<<1)
                dd 0x3f800000 + (1<<2), 0x3f800000 + (1<<3)
                dd 0x3f800000 + (1<<4), 0x3f800000 + (1<<5)
                dd 0x3f800000 + (1<<6), 0x3f800000 + (1<<7)

    set_exponent: times 8 dd 0x3f800000    ; 1.0f
    ;  broadcast-load this instead of duplicating it in memory if you're hoisting it.

代替广播加载set_exponent,您可以改组bit_select:只要设置了0x3f800000位,那么元素0是否也设置了位3或其他什么都无关紧要,只是不设置位0 .因此vpermilpsvshufps进行复制和改编都可以.

或者如果位图以整数寄存器开头,则您可以使用标量OR并避免该矢量常量. (标量OR可以在更多端口上运行.)

# alternate top of the function for input in an integer reg, not pointer.
    or     edi, 0x3f800000
    mov    [rsp-4], edi             ; red-zone
    vbroadcastss ymm0, [rsp-4]
    ;; skip the vorps

存储/重载可能具有与vmovd(1c),vpshufd xmm(1c),vinsertf128(3c)= 5c总计类似的延迟,这些延迟是从Intel SnB系列上没有AVX2或AVX512的整数寄存器进行广播的.而且它的融合域uops更少(2而不是3),并且没有命中shuffle端口(SnB系列上的p5为3 uops).您的选择可能取决于周围的代码中是否存在加载/存储压力或端口5压力.

(SnB/IvB在2个端口上具有整数洗牌单元,仅FP洗牌限制为1.Haswell删除p5之外的洗牌单元.但是除非您进行动态调度以避免在AVX2 CPU上使用此功能,否则您可能想调整以适应新的CPU,同时仍与仅AVX1的CPU保持兼容性.)

如果要进行随机播放的ALU广播(如clang一样),则可以借用clang的技巧来完成vorps xmm,以便在拆分256位操作的AMD CPU上保存uop,并允许更窄的范围或常数.但这是没有意义的:要么将值存储在整数寄存器中(可以使用标量or),要么在内存中应使用vbroadcastss ymm.我想如果在Zen2之前调整AMD,则可以考虑广播XMM负载,VPOR XMM,然后是vinsertf128.


https://www.h-schmidt.net/FloatConverter/IEEE754.html 是有用的IEEE754 FP值十六进制位模式转换器,如果您想检查某些FP位模式代表什么值.

在所有Intel CPU上,

vcmpeqps的延迟和吞吐量与vaddps相同. (这不是不是巧合;它们在同一执行单元上运行).这意味着SnB-Broadwell上有3个周期的延迟,而Skylake上有4个周期的延迟.但是vpcmpeqd只有1c的延迟.

因此,此方法具有良好的吞吐量(仅比AVX2整数多1 uop,而无需vorps),但延迟却延长了3个周期,或者在Skylake上增加了4个周期.


但是比较浮点数不是危险还是不好的做法?

如果比较输入之一是计算的舍入结果(例如,vaddpsvmulps的输出),则

进行完全相等的比较可能会产生意外结果.布鲁斯·道森(Bruce Dawson)关于FP数学(尤其是x86)的博客系列非常出色,特别是比较浮点数,2012年版.但是在这种情况下,我们正在控制FP位模式,并且不进行舍入.

具有相同位模式的非NaN FP值将始终比较相等.

具有不同位模式的

FP值将始终比较为不相等,除了-0.0+0.0(仅符号位不同)和DAZ模式下的非规格化值.后者就是为什么我们使用vpor的原因;如果您知道DAZ已禁用并且FP硬件不需要协助来比较异常值,则可以跳过此步骤. (IIRC,Sandybridge不需要,甚至可以在没有辅助的情况下添加/降低非正规性.当Intel硬件上需要微码辅助时,通常是从常规输入产生非正规结果,但是比较不会产生FP结果.)

What I'm trying to achieve is based on each bit in a byte, set to all ones in each dword in a ymm register (or memory location)

e.g.

al = 0110 0001

ymm0 = 0x00000000 FFFFFFFF FFFFFFFF 00000000 00000000 00000000 00000000 FFFFFFFF

i.e. an inverse of vmovmskps eax, ymm0 / _mm256_movemask_ps, turning a bitmap into a vector mask.

I'm thinking there are a handful of sse/avx instructions that can do this relatively simply but I haven't been able to work it out. Preferably sandy bridge compatible so no avx2.

解决方案

If AVX2 is available, see is there an inverse instruction to the movemask instruction in intel avx2? instead for more efficient versions using integer SIMD. You could use that idea and split your bitmap into two 4-bit chunks for use with a LUT. That might perform fairly well: vinsertf128 has 1 per clock throughput on Sandybridge, and one per 0.5c on Haswell/Skylake.

A SIMD-integer solution with AVX1 could just do the same work twice for high/low vector halves (2x broadcast the bitmap, 2x mask it, 2x vpcmpeqd xmm), then vinsertf128, but that kinda sucks.

You might consider making an AVX2 version separate from your AVX1-only version, using vpbroadcastd ymm0, mem / vpand ymm0, mask / vpcmpeqd dst, ymm0, mask, because that's very efficient, especially if you're loading the bitmap from memory and you can read a whole dword for the bitmap. (Broadcast-loads of dword or qword don't need an ALU shuffle so it's worth overreading). The mask is set_epi32(1<<7, 1<<6, 1<<5< ..., 1<<0), which you can load with vpmovzxbd ymm, qword [constant] so it only takes 8 bytes of data memory for 8 elements.


Intrinsics version, see below for explanation and asm version. Compiles about how we expect on Godbolt with gcc/clang -march=sandybridge

#include <immintrin.h>
// AVX2 can be significantly more efficient, doing this with integer SIMD
// Especially for the case where the bitmap is in an integer register, not memory
// It's fine if `bitmap` contains high garbage; make sure your C compiler broadcasts from a dword in memory if possible instead of integer load with zero extension.
// e.g. __m256 _mm256_broadcast_ss(float *a);  or memcpy to unsigned.
// Store/reload is not a bad strategy vs. movd + 2 shuffles so maybe just do it even if the value might be in a register; it will force some compilers to store/broadcast-load.  But it might not be type-punning safe  even though it's an intrinsic.

// Low bit -> element 0, etc.
__m256 inverse_movemask_ps_avx1(unsigned bitmap)
{
    // if you know DAZ is off: don't OR, just AND/CMPEQ with subnormal bit patterns
    // FTZ is irrelevant, we only use bitwise booleans and CMPPS
    const __m256 exponent = _mm256_set1_ps(1.0f);   // set1_epi32(0x3f800000)
    const __m256 bit_select = _mm256_castsi256_ps(
          _mm256_set_epi32(  // exponent + low significand bits
                0x3f800000 + (1<<7), 0x3f800000 + (1<<6),
                0x3f800000 + (1<<5), 0x3f800000 + (1<<4),
                0x3f800000 + (1<<3), 0x3f800000 + (1<<2),
                0x3f800000 + (1<<1), 0x3f800000 + (1<<0)
          ));

    // bitmap |= 0x3f800000;  // more efficient to do this scalar, but only if the data was in a register to start with
    __m256  bcast = _mm256_castsi256_ps(_mm256_set1_epi32(bitmap));
    __m256  ored  = _mm256_or_ps(bcast, exponent);
    __m256  isolated = _mm256_and_ps(ored, bit_select);
    return _mm256_cmp_ps(isolated, bit_select, _CMP_EQ_OQ);
}

If we get creative, we can use AVX1 FP instructions to do the same thing. AVX1 has dword broadcast (vbroadcastss ymm0, mem), and booleans (vandps). That will produce bit patterns that are valid single-precision floats so we could use vcmpeqps, but they're all denormals if we leave the bitmap bits in the bottom of the element. That might actually be fine on Sandybridge: there might be no penalty for comparing denormals. But it will break if your code ever runs with DAZ (denormals-are-zero), so we should avoid this.

We could vpor with something to set an exponent before or after masking, or we could shift the bitmap up into the 8-bit exponent field of the IEEE floating-point format. If your bitmap starts in an integer register, shifting it would be good, because shl eax, 23 before movd is cheap. But if it starts in memory, that means giving up on using a cheap vbroadcastss load. Or you could broadcast-load to xmm, vpslld xmm0, xmm0, 23 / vinsertf128 ymm0, xmm0, 1. But that's still worse than vbroadcastss / vorps / vandps / vcmpeqps

(Scalar OR before store/reload solves the same problem.)

So:

# untested
# pointer to bitmap in rdi
inverse_movemask:
    vbroadcastss  ymm0, [rdi]

    vorps         ymm0, ymm0, [set_exponent]   ; or hoist this constant out with a broadcast-load

    vmovaps       ymm7, [bit_select]          ; hoist this out of any loop, too
    vandps        ymm0, ymm0, ymm7
    ; ymm0 exponent = 2^0, mantissa = 0 or 1<<i where i = element number
    vcmpeqps      ymm0, ymm0, ymm7
    ret

section .rodata
ALIGN 32
      ; low bit -> low element.  _mm_setr order
    bit_select: dd 0x3f800000 + (1<<0), 0x3f800000 + (1<<1)
                dd 0x3f800000 + (1<<2), 0x3f800000 + (1<<3)
                dd 0x3f800000 + (1<<4), 0x3f800000 + (1<<5)
                dd 0x3f800000 + (1<<6), 0x3f800000 + (1<<7)

    set_exponent: times 8 dd 0x3f800000    ; 1.0f
    ;  broadcast-load this instead of duplicating it in memory if you're hoisting it.

Instead of broadcast-loading set_exponent, you could instead shuffle bit_select: as long as the 0x3f800000 bits are set, it doesn't matter if element 0 also sets bit 3 or something, just not bit 0. So vpermilps or vshufps to copy-and-shuffle would work.

Or if the bitmap is in an integer register to start with, you can use scalar OR and avoid that vector constant. (And scalar OR runs on more ports.)

# alternate top of the function for input in an integer reg, not pointer.
    or     edi, 0x3f800000
    mov    [rsp-4], edi             ; red-zone
    vbroadcastss ymm0, [rsp-4]
    ;; skip the vorps

Store/reload might have similar latency to vmovd (1c), vpshufd xmm (1c), vinsertf128 (3c) = 5c total to broadcast from an integer register without AVX2 or AVX512 on Intel SnB-family. And it's fewer fused-domain uops (2 instead of 3), and doesn't hit the shuffle port (3 uops for p5 on SnB-family). Your choice might depend on whether there's there's load/store pressure or port-5 pressure in the surrounding code.

(SnB/IvB have integer-shuffle units on 2 ports, only FP shuffles are limited to 1. Haswell remove the shuffle units outside of p5.But unless you do dynamic dispatching to avoid using this on AVX2 CPUs, you might want to tune for newer CPUs while still maintaining compat with AVX1-only CPUs.)

If you were going to do an ALU broadcast with shuffles (like clang does), you could borrow clang's trick of doing a vorps xmm to save a uop on AMD CPUs that split 256-bit ops, and to allow a narrower OR constant. But that's pointless: either you had the value in an integer register (where you can use scalar or), or it was in memory where you should have used vbroadcastss ymm. I guess if tuning for AMD before Zen2 you might consider an broadcast XMM load, VPOR XMM, then vinsertf128.


https://www.h-schmidt.net/FloatConverter/IEEE754.html is a useful IEEE754 FP value <-> hex bit pattern converter, in case you want to check what value some FP bit pattern represents.

vcmpeqps has the same latency and throughput as vaddps on all Intel CPUs. (This is not a coincidence; they run on the same execution unit). That means 3 cycle latency on SnB-Broadwell, and 4 cycle latency on Skylake. But vpcmpeqd is only 1c latency.

So this method has good throughput (only 1 uop more than AVX2 integer, where vorps isn't needed), but worse latency by 3 cycles, or 4 on Skylake.


But isn't comparing floating point numbers dangerous or bad practice?

Comparison for exact equality can give unexpected results when one of the comparison inputs is the rounded result of a calculation (e.g. the output of vaddps or vmulps). Bruce Dawson's blog series on FP math in general and x86 in particular is excellent, specifically Comparing Floating Point Numbers, 2012 Edition. But in this case, we're controlling the FP bit-patterns, and there's no rounding.

Non-NaN FP values with the same bit-pattern will always compare equal.

FP values with different bit-patterns will always compare as not-equal, except for -0.0 and +0.0 (which differ in sign bit only), and denormalized values in DAZ mode. The latter is why we're using vpor; you can skip it if you know DAZ is disabled and your FP hardware doesn't require an assist for comparison of denormals. (IIRC, Sandybridge doesn't, and can even add / sub denormals without an assist. When microcode assists are needed on Intel hardware, it's usually when producing a denormal result from normal inputs, but compares don't produce an FP result.)

这篇关于在没有AVX2的情况下,如何使用字节中的位设置ymm寄存器中的双字? (vmovmskps的倒数)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-29 15:03