如何将标量合并为向量而不编译器浪费指令将上元素归零?英特尔内在函数的设计限制?

本文介绍了如何将标量合并为向量而不编译器浪费指令将上元素归零?英特尔内在函数的设计限制?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我没有想到特定的用例；我在问这是否真的是英特尔内在函数的设计缺陷/限制，或者我是否只是遗漏了一些东西.

I don't have a particular use-case in mind; I'm asking if this is really a design flaw / limitation in Intel's intrinsics or if I'm just missing something.

如果您想将标量浮点数与现有向量相结合，似乎没有办法在没有高元素归零或使用 Intel 内在函数将标量广播到向量的情况下做到这一点.我还没有研究过 GNU C 原生向量扩展和相关的内置函数.

If you want to combine a scalar float with an existing vector, there doesn't seem to be a way to do it without high-element-zeroing or broadcasting the scalar into a vector, using Intel intrinsics. I haven't investigated GNU C native vector extensions and the associated builtins.

如果额外的内在优化消失，这不会太糟糕，但它不会与 gcc(5.4 或 6.2).也没有很好的方法来使用 pmovzx 或 insertps 作为负载，因为相关的原因是它们的内在函数只采用向量参数.(并且 gcc 不会将标量 -> 向量加载折叠到 asm 指令中.)

This wouldn't be too bad if the extra intrinsic optimized away, but it doesn't with gcc (5.4 or 6.2). There's also no nice way to use pmovzx or insertps as loads, for the related reason that their intrinsics only take vector args. (And gcc doesn't fold a scalar->vector load into the asm instruction.)

__m128 replace_lower_two_elements(__m128 v, float x) {
  __m128 xv = _mm_set_ss(x);        // WANTED: something else for this step, some compilers actually compile this to a separate insn
  return _mm_shuffle_ps(v, xv, 0);  // lower 2 elements are both x, and the garbage is gone
}

gcc 5.3 -march=nehalem -O3 输出，以启用 SSE4.1 并针对该 Intel CPU 进行调整:(如果没有 SSE4.1，情况会更糟；多个指令将上层元素归零).

gcc 5.3 -march=nehalem -O3 output, to enable SSE4.1 and tune for that Intel CPU: (It's even worse without SSE4.1; multiple instructions to zero the upper elements).

    insertps  xmm1, xmm1, 0xe    # pointless zeroing of upper elements.  shufps only reads the low element of xmm1
    shufps    xmm0, xmm1, 0      # The function *should* just compile to this.
    ret

TL:DR:这个问题的其余部分只是问你是否真的可以有效地做到这一点，如果不能，为什么不能.

clang 的 shuffle-optimizer 做到了这一点，并且不会浪费指令将高元素归零(_mm_set_ss(x))，或将标量复制到它们中(_mm_set1_ps(x)).与其编写编译器必须优化的东西，不如说有一种方法可以有效地"编写它吗?首先在 C 中?即使是最近的 gcc 也没有优化它，所以这是一个真正的(但次要的)问题.

clang's shuffle-optimizer gets this right, and doesn't waste instructions on zeroing high elements (_mm_set_ss(x)), or duplicating the scalar into them (_mm_set1_ps(x)). Instead of writing something the compiler has to optimize away, shouldn't there be a way to write it "efficiently" in C in the first place? Even very recent gcc doesn't optimize it away, so this is a real (but minor) problem.

如果有一个标量->128b 等价于 __m256 _mm256_castps128_ps256 (__m128 a).即产生一个 __m128 在上元素中带有未定义的垃圾，在低元素中生成浮点数，如果标量浮点数/双精度数已经在 xmm 寄存器中，则编译为零 asm 指令.

This would be possible if there was a scalar->128b equivalent of __m256 _mm256_castps128_ps256 (__m128 a). i.e. produce a __m128 with undefined garbage in upper elements, and the float in the low element, compiling to zero asm instructions if the scalar float/double was already in an xmm register.

以下内在函数不存在，但它们应该.

一个标量 -> __m128 相当于 _mm256_castps128_ps256 如上所述.标量已注册情况的最通用解决方案.

a scalar->__m128 equivalent of _mm256_castps128_ps256 as described above. The most general solution for the scalar-already-in-register case.

__m128 _mm_move_ss_scalar (__m128 a, float s):用标量s替换向量a的低元素.如果有通用标量-> __m128(上一个要点)，这实际上不是必需的.(movss 的 reg-reg 形式合并，不像归零的加载形式，与 movd 不同在这两种情况下都将上层元素归零.要复制一个没有错误依赖关系的标量浮点寄存器，请使用 movaps).

__m128 _mm_move_ss_scalar (__m128 a, float s): replace low element of vector a with scalar s. This isn't actually necessary if there's a general-purpose scalar->__m128 (previous bullet point). (The reg-reg form of movss merges, unlike the load form which zeros, and unlike movd which zeros upper elements in both cases. To copy a register holding a scalar float without false dependencies, use movaps).

__m128i _mm_loadzxbd (const uint8_t *four_bytes) 和其他大小的 PMOVZX/PMOVSX:AFAICT，没有好的安全方法可以使用 PMOVZX 内在函数作为负载，因为不方便的安全方法不会使用 gcc 进行优化.

__m128i _mm_loadzxbd (const uint8_t *four_bytes) and other sizes of PMOVZX / PMOVSX: AFAICT, there's no good safe way to use the PMOVZX intrinsics as a load, because the inconvenient safe way doesn't optimize away with gcc.

__m128 _mm_insertload_ps (__m128 a, float *s, const int imm8).INSERTPS 的行为与负载不同:imm8 的高 2 位被忽略，并且它始终采用有效地址处的标量(而不是内存中向量的元素).这使它可以处理非 16B 对齐的地址，并且如果 float 正好位于未映射页面之前，即使没有错误也可以正常工作.

__m128 _mm_insertload_ps (__m128 a, float *s, const int imm8). INSERTPS behaves differently as a load: the upper 2 bits of the imm8 are ignored, and it always takes the scalar at the effective address (instead of an element from a vector in memory). This lets it work with addresses that aren't 16B-aligned, and work even without faulting if the float right before an unmapped page.

与 PMOVZX 一样，gcc 无法将上元素归零的 _mm_load_ss() 折叠到 INSERTPS 的内存操作数中.(注意如果 imm8 的高 2 位不是都为零，则 _mm_insert_ps(xmm0, _mm_load_ss(), imm8) 可以编译为 insertps xmm0,xmm0,foo>，使用不同的 imm8 将 vec 中的元素归零，就像 src 元素实际上是 MOVSS 从内存中生成的零一样.在这种情况下，Clang 实际上使用 XORPS/BLENDPS)

Like with PMOVZX, gcc fails to fold an upper-element-zeroing _mm_load_ss() into a memory operand for INSERTPS. (Note that if the upper 2 bits of the imm8 aren't both zero, then _mm_insert_ps(xmm0, _mm_load_ss(), imm8) can compile to insertps xmm0,xmm0,foo, with a different imm8 that zeros elements in vec as-if the src element was actually a zero produced by MOVSS from memory. Clang actually uses XORPS/BLENDPS in that case)

是否有任何可行的解决方法来模拟那些既安全(不要通过加载可能触及下一页和段错误的 16B 在 -O0 处中断)和高效(没有至少在当前的 gcc 和 clang 上浪费了 -O3 的指令，最好还有其他主要编译器)?最好也是以一种可读的方式，但如果有必要，它可以放在一个内联包装函数后面，比如 __m128 float_to_vec(float a){ something(a);}.

Are there any viable workarounds to emulate any of those that are both safe (don't break at -O0 by e.g. loading 16B that might touch the next page and segfault), and efficient (no wasted instructions at -O3 with current gcc and clang at least, preferably also other major compilers)? Preferably also in a readable way, but if necessary it could be put behind an inline wrapper function like __m128 float_to_vec(float a){ something(a); }.

英特尔是否有充分的理由不引入这样的内在函数?他们可以在添加 _mm256_castps128_ps256 的同时添加一个带有未定义上元素的 float->__m128.这是编译器内部结构使其难以实现的问题吗?也许特别是 ICC 内部结构?

Is there any good reason for Intel not to introduce intrinsics like that? They could have added a float->__m128 with undefined upper elements at the same time as adding _mm256_castps128_ps256. Is this a matter of compiler internals making it hard to implement? Perhaps specifically ICC internals?

x86-64 上的主要调用约定(SysV 或 MS __vectorcall)采用 xmm0 中的第一个 FP args 并返回 xmm0 中的标量 FP args，上层元素未定义.(有关 ABI 文档，请参阅 x86 标签维基).这意味着编译器在具有未知上层元素的寄存器中具有标量浮点数/双精度数的情况并不少见.这在矢量化内循环中很少见，所以我认为避免这些无用的指令只会节省一点代码大小.

The major calling conventions on x86-64 (SysV or MS __vectorcall) take the first FP arg in xmm0 and return scalar FP args in xmm0, with upper elements undefined. (See the x86 tag wiki for ABI docs). This means it's not uncommon for the compiler to have a scalar float/double in a register with unknown upper elements. This will be rare in a vectorized inner loop, so I think avoiding these useless instructions will mostly just save a bit of code size.

pmovzx 情况更严重:这可能是您在内部循环中使用的东西(例如，对于 VPERMD shuffle 掩码的 LUT，与将每个索引填充到内存中的 32 位相比，缓存占用空间节省了 4 倍).

The pmovzx case is more serious: that is something you might use in an inner loop (e.g. for a LUT of VPERMD shuffle masks, saving a factor of 4 in cache footprint vs. storing each index padded to 32 bits in memory).

pmovzx-as-a-load 问题已经困扰我一段时间了，这个问题的原始版本让我思考在 xmm 寄存器中使用标量浮点的相关问题.pmovzx 作为负载的用例可能比标量 -> __m128 的用例更多.

The pmovzx-as-a-load issue has been bothering me for a while now, and the original version of this question got me thinking about the related issue of using a scalar float in an xmm register. There are probably more use-cases for pmovzx as a load than for scalar->__m128.

推荐答案

它在 GNU C 内联 asm 中是可行的，但是这很丑陋并且无法进行许多优化，包括常量传播 (https://gcc.gnu.org/wiki/DontUseInlineAsm).这不是公认的答案.我将此添加为答案而不是问题的一部分，因此问题并不大.

It's doable with GNU C inline asm, but this is ugly and defeats many optimizations, including constant-propagation (https://gcc.gnu.org/wiki/DontUseInlineAsm). This will not be the accepted answer. I'm adding this as an answer instead of part of the question so the question isn't huge.

// don't use this: defeating optimizations is probably worse than an extra instruction
#ifdef __GNUC__
__m128 float_to_vec_inlineasm(float x) {
  __m128 retval;
  asm ("" : "=x"(retval) : "0"(x));   // matching constraint: provide x in the same xmm reg as retval
  return retval;
}
#endif

这确实会根据需要编译为单个 ret，并将内联让您 shufps 将标量转换为向量:

This does compile to a single ret, as desired, and will inline to let you shufps a scalar into a vector:

gcc5.3
float_to_vec_and_shuffle_asm(float __vector(4), float):
    shufps  xmm0, xmm1, 0       # tmp93, xv,
    ret

在 Godbolt 编译器浏览器.

这在纯汇编语言中显然是微不足道的，在这种情况下，您不必与编译器斗争以使其不发出您不想要或不需要的指令.

This is obviously trivial in pure assembly language, where you don't have to fight with a compiler to get it not to emit instructions you don't want or need.

我还没有找到任何真正的方法来编写 __m128 float_to_vec(float a){ something(a);} 编译成一个 ret 指令.使用 _mm_undefined_pd() 和 _mm_move_sd() 尝试 double 实际上会使 gcc 的代码变得更糟(请参阅上面的 Godbolt 链接).现有的 float->__m128 内在函数没有任何帮助.

I haven't found any real way to write a __m128 float_to_vec(float a){ something(a); } that compiles to just a ret instruction. An attempt for double using _mm_undefined_pd() and _mm_move_sd() actually makes worse code with gcc (see the Godbolt link above). None of the existing float->__m128 intrinsics help.

题外话:实际 _mm_set_ss() 代码生成策略:当您编写必须将上层元素归零的代码时，编译器会从一系列有趣的策略中进行挑选.有些不错，有些奇怪.如您在上面的 Godbolt 链接中所见，同一编译器(gcc 或 clang)上的 double 和 float 之间的策略也有所不同.

Off-topic: actual _mm_set_ss() code-gen strategies: When you do write code that has to zero upper elements, compilers pick from an interesting range of strategies. Some good, some weird. The strategies also differ between double and float on the same compiler (gcc or clang), as you can see on the Godbolt link above.

一个例子:__m128 float_to_vec(float x){ return _mm_set_ss(x);} 编译为:

    # gcc5.3 -march=core2
    movd    eax, xmm0      # movd xmm0,xmm0 would work; IDK why gcc doesn't do that
    movd    xmm0, eax
    ret

    # gcc5.3 -march=nehalem
    insertps        xmm0, xmm0, 0xe
    ret

    # clang3.8 -march=nehalem
    xorps   xmm1, xmm1
    blendps xmm0, xmm1, 14          # xmm0 = xmm0[0],xmm1[1,2,3]
    ret

这篇关于如何将标量合并为向量而不编译器浪费指令将上元素归零?英特尔内在函数的设计限制?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！