本文介绍了显示向量寄存器的约定的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有显示/写入大型寄存器的约定,例如Intel AVX指令集中的可用寄存器?

例如,如果您在最低有效字节中有1个,在最高有效字节中有20个,而在xmm寄存器中的其他地方为0,则首选按字节显示(小尾数):

[1, 0, 0, 0, ..., 0, 20]

或者这是首选:

[20, 0, 0, 0, ..., 0, 1]

类似地,当显示由较大数据项组成的寄存器时,是否应用相同的规则?例如,为了将寄存器显示为DWORD,我假设每个DWORD仍然以通常的方式(大端顺序)编写,但是DWORDS的顺序是什么?

[0x1, 0x0, ..., 0x14]

vs

[0x14, 0x0, ..., 0x1]

讨论

我认为两个最有希望的答案就是"LSE first"(即上述示例中的第一个输出)或"MSE first"(第二个输出).两者都不依赖于平台的字节序,因为确实一次在寄存器中的数据通常是字节序无关的(就像GP寄存器或longint或C语言中的任何操作都独立于字节序).字节顺序出现在寄存器<->内存接口中,在这里我要询问的是寄存器中已经存在的数据.

可能还会存在其他答案,例如取决于字节序的输出(而Paul R的答案可能是一个,但我无法确定).

LSE优先

LSE-first的一个优点似乎尤其是按字节输出:通常字节从0到N进行编号,而LSB为零,因此LSB-first输出将其输出随着索引的增加,就像您将输出大小为N的字节数组一样.

在小端字节体系结构上也很不错,因为输出然后匹配存储在内存中的相同向量的内存中表示形式.

MSE优先

这里的主要优点似乎是较小元素的输出与较大尺寸的输出顺序相同(仅在不同分组的情况下).例如,对于MSB表示法为[0x4, 0x3, 0x2, 0x1]的4字节向量,字节元素,word和dword元素的输出为:

[0x4、0x3、0x2、0x1][0x0403,0x0201][0x04030201]

从本质上讲,即使从字节输出中,您也可以读取"字或dword输出,反之亦然,因为字节已经以通常的MSB优先顺序显示数字.另一方面,LSE-first的相应输出为:

[0x1、0x2、0x3、0x4][0x0201,0x0403][0x04030201]

请注意,每个图层都相对于其上方的行进行交换,因此读取较大或较小的值要困难得多.您需要更多地依赖于输出最适合您的问题的元素.

这种格式的另一个优点是,在BE体系结构上,输出将与存储在内存中的同一矢量的内存中表示形式匹配.

英特尔首先在其手册中使用MSE.


最低有效的元素

这样的编号不仅用于文档目的,而且在结构上是可见的,例如在随机掩码中.

当然,与LE平台上LSE-first的相应优势相比,此优势微不足道,因为BE在商用SIMD硬件中几乎已经消失了.

解决方案

保持一致是最重要的事情;如果我正在处理已经具有LSE优先注释或变量名的现有代码,则将其匹配.

给出选择,我更喜欢在评论中使用MSE优先表示法,尤其是在设计带有混洗功能的产品或特别是包装/拆包成不同元素尺寸的产品时.

英特尔不仅在其手册中的图表中使用MSE-first,而且在诸如pslldq(字节移位)和psrlw(位移位)之类的内在函数/指令的命名中使用:左位/字节转向MSB . LSE优先思考并不能使您摆脱思维上的逆转,这意味着您在考虑班次而不是装载/存储时必须这样做.由于x86是低位优先的,因此无论如何您有时都必须考虑这一点.


在MSE首先考虑向量的过程中,请记住,存储顺序是从右到左.当您需要考虑从一块内存中重叠未对齐的负载时,您可以按从右到左的顺序绘制内存内容,因此您可以查看它的向量长度窗口. >

在文本编辑器中,在某项的左侧添加新文本并将现有文本移到右侧是没有问题的,因此在注释中添加更多元素不是问题.

MSE优先表示法的两个主要缺点是:

  • 难于向后键入字母(例如对于32位元素的AVX矢量,为h g f e | d c b a),所以有时我只是从右边开始输入a,左箭头,b,空格,向左ctrl键,c,空格...等等.

  • 与C数组初始化程序顺序相反.通常不是问题,因为_mm_set_epi*使用MSE优先顺序. (使用_mm_setr_epi*匹配LSE优先注释).


当尝试设计256b vpalignr的车道交叉版本时,MSE优先是一个很好的例子:请参阅我对这个问题的回答如何使用AVX2有效地连接两个向量?.其中包括MSE优先表示法中的设计说明.

作为另一个示例,请考虑在整个向量上实现可变计数字节移位.您可以制作一个pshufb控制向量表,但这将浪费大量的缓存空间.从内存加载滑动窗口更好:

 /*  Example of using MSE notation for memory as well as vectors

// 4-element vectors to keep the design notes compact
// I started by just writing down a couple rows of this, then noticing which way they lined up
<< 3:                       00 FF FF FF
<< 1:                 02 01 00 FF
   0:              03 02 01 00
>> 2:        FF FF 03 02
>> 3:     FF FF FF 03
>> 4:  FF FF FF FF

       FF FF FF FF 03 02 01 00 FF FF FF FF
  highest address                       lowest address
*/

#include <immintrin.h>
#include <stdint.h>
// positive counts are right shifts, negative counts are left
// a left-only or right-only implementation would only have one side of the table,
// and only need 32B alignment for the constant in memory to prevent cache-line splits.
__m128i vshift(__m128i v, intptr_t bytes_right)
{   // intptr_t means the caller has to sign-extend it to the width of a pointer, saving a movsx in the non-inline version

   // C11 uses _Alignas, C++11 uses alignas
    _Alignas(64) static const int32_t shuffles[] = {
        -1, -1, -1, -1,
        0x03020100, 0x07060504, 0x0b0a0908, 0x0f0e0d0c,
        -1, -1, -1, -1
    };  // compact but messy with a mix of ordering :/
    const char *identity_shuffle = 16 + (const char*)shuffles;  // points to the middle 16B

    //  count &= 0xf;  tricky to efficiently limit the count while still allowing >>16 to zero the vector, and to allow negative.
    __m128i control = _mm_load_si128((const __m128i*) (identity_shuffle + bytes_right));
    return _mm_shuffle_epi8(v, control);
}
 

对于MSE优先的情况,这是最坏的情况,因为右移会从最左边移出一个窗口.在LSE优先表示法中,它看起来更自然.尽管如此,除非我有倒退的:P,否则我认为它表明您可以成功使用MSE优先表示法,即使您希望遇到一些棘手的事情.它并没有觉得弯曲或过于复杂.我刚刚开始写下随机控制向量,然后将它们排成一行.如果使用uint8_t shuffles[] = { 0xff, 0xff, ..., 0, 1, 2, ..., 0xff };,在转换为C数组时,我可以使它稍微简单一些.我还没有测试过,只有它编译了t o一条指令:

    vpshufb xmm0, xmm0, xmmword ptr [rdi + vshift.shuffles+16]
    ret


MSE使您可以更轻松地注意到何时可以使用移位而不是随机播放指令来减少端口5上的压力. psllq xmm, 16/_mm_slli_epi64(v,16)将单词元素左移一位(在qword边界处置零).或者,当您需要移位字节元素时,唯一可用的移位是16位或更大.每元素最窄的移位是32位元素( vpsllvd )

当使用较大或较小粒度的混洗或混合时,例如,MSE可以很容易地正确获得混洗常数. pshufd当您可以将成对的单词元素保持在一起时,或者pshufb可以在整个矢量上随机排列单词(因为pshuflw/hw受限制).

_MM_SHUFFLE(d,c,b,a)也按MSE顺序排列.将其写为单个整数的任何其他方式也是如此,例如C ++ 14 0b11'10'01'000xE4(身份混洗).使用LSE优先表示法可使混洗常数相对于您的注释向后看". (除了pshufb常量,可以使用_mm_setr编写)

Is there a convention for displaying/writing large registers, like those available in the Intel AVX instruction set?

For example, if you have 1 in the least significant byte, and 20 in the most significant byte, and 0 elsewhere in an xmm register, for a byte-wise display is the following preferred (little-endian):

[1, 0, 0, 0, ..., 0, 20]

or is this preferred:

[20, 0, 0, 0, ..., 0, 1]

Similarly, when displaying such registers as made up of larger data items, is the same rule applied? E.g., to display the register as DWORDs, I assume each DWORD is still written in the usual (big-endian) way, but what is the order of the DWORDS:

[0x1, 0x0, ..., 0x14]

vs

[0x14, 0x0, ..., 0x1]

Discussion

I think the two most promising answers are simply "LSE first" (i.e., the first output in the examples above) or "MSE first" (the second output). Neither depends on the endianness of the platform, as indeed once in a register data is generally endian independent (just like operations on a GP register or a long or int or whatever in C are independent of endianness). Endianness comes up in the register <-> memory interface, and here I'm asking about data already in a register.

It is possible that other answers exist, such as output that depends on endianness (and Paul R's answer may be one, but I can't tell).

LSE First

One advantage of LSE-first seems to be especially with byte-wise output: often the bytes are numbered from 0 to N, with the LSB being zero, so LSB-first output outputs it with increasing indexes, much like you'd output an array of bytes of size N.

It's also nice on little endian architectures since the output then matches the in-memory representation of the same vector stored to memory.

MSE First

The main advantage here seems to be that the output for smaller elements is in the same order as for larger sizes (only with different grouping). For example, for a 4-byte vector in MSB notation [0x4, 0x3, 0x2, 0x1], the output for byte elements, word and dword elements would be:

[0x4, 0x3, 0x2, 0x1][ 0x0403, 0x0201 ][ 0x04030201 ]

Essentially, even from the byte output you can just "read off" the word or dword output, or vice-versa, since the bytes are already in the usual MSB-first order for number display. On the other hand, the corresponding output for LSE-first is:

[0x1, 0x2, 0x3, 0x4][ 0x0201 , 0x0403 ][ 0x04030201 ]

Note that each layer undergoes swaps relative to the row above it, so it's much harder to read off larger or smaller values. You'd need to rely more on outputting the element that is the most natural for your problem.

This format also has the advantage that on BE architectures the output then matches the in-memory representation of the same vector stored to memory.

Intel uses MSE first in its manuals.


Least Significant Element

Such numberings are not just for documentation purposes - they are architecturally visible, e.g., in shuffle masks.

Of course this advantage is minuscule compared to the corresponding advantage of LSE-first on LE platforms since BE is almost dead in commodity SIMD hardware.

解决方案

Being consistent is the most important thing; If I'm working on existing code that already has LSE-first comments or variable names, I match that.

Given the choice, I prefer MSE-first notation in comments, especially when designing something with shuffles or especially packing/unpacking to different element sizes.

Intel uses MSE-first not only in their diagrams in manuals, but in the naming of intrinsics/instructions like pslldq (byte shift) and psrlw (bit-shift): a left bit/byte shift goes towards the MSB. LSE-first thinking doesn't save you from mentally reversing things, it means you have to do it when thinking about shifts instead of loads/stores. Since x86 is little-endian, you sometimes have to be thinking about this anyway.


In MSE-first thinking about vectors, just remember that memory order is right to left. When you need to think about overlapping unaligned loads from a block of memory, you can draw the memory contents in right-to-left order, so you can look at vector-length windows of it.

In a text editor, it's no problem to add new text at the left hand side of something and have the existing text displaced to the right, so adding more elements to a comment isn't a problem.

Two major downsides to MSE-first notation are:

  • harder to type the alphabet backwards (like h g f e | d c b a for an AVX vector of 32-bit elements), so I sometimes just start from the right and type a, left-arrow, b, space, ctrl-left arrow, c, space, ... or something like that.

  • Opposite from C array-initializer order. Normally not a problem, because _mm_set_epi* uses MSE-first order. (Use _mm_setr_epi* to match LSE-first comments).


An example where MSE-first is nice is when trying to design a lane-crossing version of 256b vpalignr: See my answer on that questionHow to concatenate two vector efficiently using AVX2?. That includes design-notes in MSE-first notation.

As another example, consider implementing a variable-count byte-shift across a whole vector. You could make a table of pshufb control vectors, but that would be a massive waste of cache footprint. Much better to load a sliding window from memory:

/*  Example of using MSE notation for memory as well as vectors

// 4-element vectors to keep the design notes compact
// I started by just writing down a couple rows of this, then noticing which way they lined up
<< 3:                       00 FF FF FF
<< 1:                 02 01 00 FF
   0:              03 02 01 00
>> 2:        FF FF 03 02
>> 3:     FF FF FF 03
>> 4:  FF FF FF FF

       FF FF FF FF 03 02 01 00 FF FF FF FF
  highest address                       lowest address
*/

#include <immintrin.h>
#include <stdint.h>
// positive counts are right shifts, negative counts are left
// a left-only or right-only implementation would only have one side of the table,
// and only need 32B alignment for the constant in memory to prevent cache-line splits.
__m128i vshift(__m128i v, intptr_t bytes_right)
{   // intptr_t means the caller has to sign-extend it to the width of a pointer, saving a movsx in the non-inline version

   // C11 uses _Alignas, C++11 uses alignas
    _Alignas(64) static const int32_t shuffles[] = {
        -1, -1, -1, -1,
        0x03020100, 0x07060504, 0x0b0a0908, 0x0f0e0d0c,
        -1, -1, -1, -1
    };  // compact but messy with a mix of ordering :/
    const char *identity_shuffle = 16 + (const char*)shuffles;  // points to the middle 16B

    //  count &= 0xf;  tricky to efficiently limit the count while still allowing >>16 to zero the vector, and to allow negative.
    __m128i control = _mm_load_si128((const __m128i*) (identity_shuffle + bytes_right));
    return _mm_shuffle_epi8(v, control);
}

This is kind of the worst-case for MSE-first, because right-shifts take a window from farther left. In LSE-first notation, it might look more natural. Still, unless I got something backwards :P, I think it shows that you can successfully use MSE-first notation even for something you'd expect to be tricky. It didn't feel mind-bending or over-complicated. I just started writing down shuffle control vectors and then lined them up. I could have made it slightly simpler when translating to a C array if I'd used uint8_t shuffles[] = { 0xff, 0xff, ..., 0, 1, 2, ..., 0xff };.I haven't tested this, only that it compiles to one instruction:

    vpshufb xmm0, xmm0, xmmword ptr [rdi + vshift.shuffles+16]
    ret


MSE lets you notice more easily when you can use a bit-shift instead of a shuffle instruction, to reduce pressure on port 5. e.g. psllq xmm, 16/_mm_slli_epi64(v,16) to shift word elements left by one (with zeroing at qword boundaries). Or when you need to shift byte elements, but the only available shifts are 16-bit or wider. The narrowest variable-per-element shifts are 32-bit elements (vpsllvd).

MSE makes it easy to get the shuffle constant right when using larger or smaller granularity shuffles or blends, e.g. pshufd when you can keep pairs of word elements together, or pshufb to shuffle words across the whole vector (because pshuflw/hw is limited).

_MM_SHUFFLE(d,c,b,a) goes in MSE order, too. So does any other way of writing it as a single integer, like C++14 0b11'10'01'00 or 0xE4 (the identity shuffle). Using LSE-first notation will make your shuffle constants look "backwards" relative to your comments. (except for pshufb constants, which you can write with _mm_setr)

这篇关于显示向量寄存器的约定的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-30 23:40