本文介绍了GNU C本地向量:如何广播标量,如x86的_mm_set1_epi16的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何编写可移植的版本这不依赖于x86 set1的内部函数吗?

$ $ $ $ $ $ $ $ $ typedef uint16_t v8su __attribute __((vector_size(16))) ;

v8su set1_u16_x86(uint16_t标量){
return(v8su)_mm_set1_epi16(scalar); //转换为gcc
所需的元素}

当然, p

$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $' S,S};
}

我不想编写一个用于播放单曲的AVX2版本字节!



即使只有gcc-only或clang-only的回答也很有趣,对于要分配给变量,而不仅仅是作为二元运算符的操作数使用(这适用于gcc,见下文)。




如果我想使用广播标量作为二元运算符的一个操作数,它可以与gcc一起工作(,我可以这样写:

  v8su vecdiv_set1(v8su v){
return v /(v8su)_mm_set1_epi16(10); // gcc需要演员
}

但是我必须改变内在的,如果我扩展向量(< _mm256_set1_epi16 ),而不是通过改变为 vector_size(32)来将整个代码转换为AVX2一个地方(对于不需要洗牌的纯垂直SIMD)。它也击败了原生载体的部分目的,因为它不能编译ARM或任何非x86目标。

clang,不考虑与 __ m128i {aka __vector(2)long long int}兼容的 v8us {aka __vector(8)short unsigned int} 。



顺便说一下,所有这些都可以编译成gcc和clang的良好汇编()。 这只是一个优雅写法的问题,它的可读语法不会重复标量N次。例如 v / 10 足够紧凑,甚至不需要将它放在自己的函数中。



国际刑事法院是一项奖金,但不是必需的。对于ICC来说,GNU C原生载体显然是事后考虑的,甚至是这样简单的东西不能有效地编译。 set1_u16 编译为8个标量存储和一个向量加载,而不是MOVD / VPBROADCASTW(启用 -xHOST >,因为它不承认 -march = haswell ,但Godbolt在支持AVX2的服务器上运行)。纯粹铸造 _mm _ intrinsics的结果是可以的,但该部门调用了一个SVML函数!

解决方案

使用两个观察结果可以找到GCC和Clang的通用广播解决方案 /clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectorsrel =nofollow noreferrer> Clang的OpenCL向量扩展和GCC的向量扩展支持标量 - 向量操作。

  • x - 0 = x (



    相同的通用解决方案可用于不同的载体。这是一个8个无符号短裤的例子。

      #if defined(__clang__)
    typedef unsigned short v8su __attribute __((ext_vector_type(8)));
    #else
    typedef unsigned short v8su __attribute __((vector_size(16)));
    #endif

    v8su broadcast8us(short x){
    return x - (v8su){};

    $ / code>

    ICC(17)支持GCC向量扩展的一个子集,但不支持无论是矢量+标量还是矢量*标量但内部函数对于广播仍然是必需的。 MSVC不支持任何向量
    的扩展名。


    How do I write a portable GNU C builtin vectors version of this, which doesn't depend on the x86 set1 intrinsic?

    typedef uint16_t v8su __attribute__((vector_size(16)));
    
    v8su set1_u16_x86(uint16_t scalar) {
        return (v8su)_mm_set1_epi16(scalar);   // cast needed for gcc
    }
    

    Surely there must be a better way than

    v8su set1_u16(uint16_t s) {
        return (v8su){s,s,s,s,  s,s,s,s};
    }
    

    I don't want to write an AVX2 version of that for broadcasting a single byte!

    Even a gcc-only or clang-only answer to this part would be interesting, for cases where you want to assign to a variable instead of only using as an operand to a binary operator (which works well with gcc, see below).


    If I want to use a broadcast-scalar as one operand of a binary operator, this works with gcc (as documented in the manual), but not with clang:

    v8su vecdiv10(v8su v) { return v / 10; }   // doesn't compile with clang
    

    With clang, if I'm targeting only x86 and just using native vector syntax to get the compiler to generate modular multiplicative inverse constants and instructions for me, I can write:

    v8su vecdiv_set1(v8su v) {
        return v / (v8su)_mm_set1_epi16(10);   // gcc needs the cast
    }
    

    But then I have to change the intrinsic if I widen the vector (to _mm256_set1_epi16), instead of converting the whole code to AVX2 by changing to vector_size(32) in one place (for pure-vertical SIMD that doesn't need shuffling). It also defeats part of the purpose of native vectors, since that won't compile for ARM or any non-x86 target.

    The ugly cast is required because gcc, unlike clang, doesn't consider v8us {aka __vector(8) short unsigned int} compatible with __m128i {aka __vector(2) long long int}.

    BTW, all of this compiles to good asm with gcc and clang (see it on Godbolt). This is just a question of how to write elegantly, with readable syntax that doesn't repeat the scalar N times. e.g. v / 10 is compact enough that there's no need to even put it in its own function.

    Compiling efficiently with ICC is a bonus, but not required. GNU C native vectors are clearly an afterthought for ICC, and even simple stuff like this doesn't compile efficiently. set1_u16 compiles to 8 scalar stores and a vector load, instead of MOVD / VPBROADCASTW (with -xHOST enabled, because it doesn't recognize -march=haswell, but Godbolt runs on a server with AVX2 support). Purely casting the results of _mm_ intrinsics is ok, but the division calls an SVML function!

    解决方案

    A generic broadcast solution can be found for GCC and Clang using two observations

    1. Clang's OpenCL vector extensions and GCC's vector extensions support scalar - vector operations.
    2. x - 0 = x (but x + 0 does not work due to signed zero).

    Here is a solution for a vector of four floats.

    #if defined (__clang__)
    typedef float v4sf __attribute__((ext_vector_type(4)));
    #else
    typedef float v4sf __attribute__ ((vector_size (16)));
    #endif
    
    v4sf broadcast4f(float x) {
      return x - (v4sf){};
    }
    

    https://godbolt.org/g/PXr3Xb

    The same generic solution can be used for different vectors. Here is an example for a vector of eight unsigned shorts.

    #if defined (__clang__)
    typedef unsigned short v8su __attribute__((ext_vector_type(8)));
    #else
    typedef unsigned short v8su __attribute__((vector_size(16)));
    #endif
    
    v8su broadcast8us(short x) {
      return x - (v8su){};
    }
    

    ICC (17) supports a subset of the GCC vector extensions but does not support either vector + scalar or vector*scalar yet so intrinsics are still necessary for broadcasts. MSVC does not support any vectorextensions.

    这篇关于GNU C本地向量:如何广播标量,如x86的_mm_set1_epi16的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

  • 08-29 07:35