对齐和SSE奇怪的行为

本文介绍了对齐和SSE奇怪的行为的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我尝试与SSE合作，我面临一些奇怪的行为。

I try to work with SSE and i faced with some strange behaviour.

我编写简单的代码来比较两个字符串与SSE Intrinsics，运行它和它的工作。但后来我明白，在我的代码一个指针仍然不对齐，但我使用 _mm_load_si128 指令，需要指针对齐16字节边界。

I write simple code for comparing two strings with SSE Intrinsics, run it and it work. But later i understand, that in my code one of pointer still not aligned, but i use _mm_load_si128 instruction, which requires pointer aligned on a 16-byte boundary.

//Compare two different, not overlapping piece of memory
__attribute((target("avx"))) int is_equal(const void* src_1, const void* src_2, size_t size)
{
    //Skip tail for right alignment of pointer [head_1]
    const char* head_1 = (const char*)src_1;
    const char* head_2 = (const char*)src_2;
    size_t tail_n = 0;
    while (((uintptr_t)head_1 % 16) != 0 && tail_n < size)
    {
        if (*head_1 != *head_2)
            return 0;
        head_1++, head_2++, tail_n++;
    }

    //Vectorized part: check equality of memory with SSE4.1 instructions
    //src1 - aligned, src2 - NOT aligned
    const __m128i* src1 = (const __m128i*)head_1;
    const __m128i* src2 = (const __m128i*)head_2;
    const size_t n = (size - tail_n) / 32;
    for (size_t i = 0; i < n; ++i, src1 += 2, src2 += 2)
    {
        printf("src1 align: %d, src2 align: %d\n", align(src1) % 16, align(src2) % 16);
        __m128i mm11 = _mm_load_si128(src1);
        __m128i mm12 = _mm_load_si128(src1 + 1);
        __m128i mm21 = _mm_load_si128(src2);
        __m128i mm22 = _mm_load_si128(src2 + 1);

        __m128i mm1 = _mm_xor_si128(mm11, mm21);
        __m128i mm2 = _mm_xor_si128(mm12, mm22);

        __m128i mm = _mm_or_si128(mm1, mm2);

        if (!_mm_testz_si128(mm, mm))
            return 0;
    }

    //Check tail with scalar instructions
    const size_t rem = (size - tail_n) % 32;
    const char* tail_1 = (const char*)src1;
    const char* tail_2 = (const char*)src2;
    for (size_t i = 0; i < rem; i++, tail_1++, tail_2++)
    {
        if (*tail_1 != *tail_2)
            return 0;
    }
    return 1;
}

我打印两个指针的对齐方式， 't。

I print alignment of two pointers and one of this wal aligned but second - wasn't. And program still running correctly and fast.

然后我创建这样的合成测试：

Then i create synthetic test like this:

//printChars128(...) function just print 16 byte values from __m128i
const __m128i* A = (const __m128i*)buf;
const __m128i* B = (const __m128i*)(buf + rand() % 15 + 1);
for (int i = 0; i < 5; i++, A++, B++)
{
    __m128i A1 = _mm_load_si128(A);
    __m128i B1 = _mm_load_si128(B);
    printChars128(A1);
    printChars128(B1);
}

并且崩溃，正如我们的预期，在第一次迭代时， B.

And it crashes, as we expected, on first iteration, when try load pointer B.

有趣的事实，如果我切换目标到 sse4.2 然后我的实现 is_equal 会崩溃。

Interesting fact that if i switch target to sse4.2 then my implementation of is_equal will crash.

另一个有趣的事实，如果我尝试对齐第二个指针而不是第一个（因此第一个指针将不对齐，第二个对齐），则 is_equal 将会崩溃。

Another interesting fact that if i try align second pointer instead of first (so first pointer will be not aligned, second - aligned), then is_equal will crash.

所以，我的问题是：为什么 is_equal 函数工作正常只有第一个指针对齐如果我启用 avx 指令生成？

So, my question is: "Why is_equal function works fine with only first pointer aligned if i enable avx instruction generation?"

UPD：这是 C ++ 代码。我在Windows，x86下用 MinGW64 / g ++，gcc 4.9.2版本

UPD: This is C++ code. I compile my code with MinGW64/g++, gcc version 4.9.2 under Windows, x86.

编译字符串： g ++。exe main.cpp -Wall -Wextra -std = c ++ 11 -O2 -Wcast-align -Wcast-qual -o main.exe

`推荐答案`

TL：DR ： _mm_load _ * （在编译时）折叠成其他指令的内存操作数。，除了 vmovdqa 的特定对齐加载/存储说明。

TL:DR: Loads from _mm_load_* intrinsics can be folded (at compile time) into memory operands to other instructions. The AVX versions of vector instructions don't require alignment for memory operands, except for specifically-aligned load/store instructions like vmovdqa.

在向量指令（例如 pxor xmm0，[src1] ）的传统SSE编码中，未对齐的128位存储器操作数将发生故障，除非使用特殊的未对齐加载/存储指令 / ）。

In the legacy SSE encoding of vector instructions (like pxor xmm0, [src1]) , unaligned 128 bit memory operands will fault except with the special unaligned load/store instructions (like movdqu / movups).

向量指令的（如 vpxor xmm1，xmm0，[src1] ）不会与未对齐的内存错误，除非使用对齐所需的加载/存储指令（如或）。

The VEX-encoding of vector instructions (like vpxor xmm1, xmm0, [src1]) doesn't fault with unaligned memory, except with the alignment-required load/store instructions (like vmovdqa, or vmovntdq).

_mm_loadu_si128 与 _mm_load_si128 （和store / storeu）内在函数通信对齐保证编译器，但不强制它实际发出一个特定的指令。

The _mm_loadu_si128 vs. _mm_load_si128 (and store/storeu) intrinsics communicate alignment guarantees to the compiler, but don't force it to actually emit a specific instruction.

在优化使用内在函数的代码时，as-if规则仍然适用。负载可以折叠到使用它的向量ALU指令的内存操作数中，只要不会引入故障的风险即可。这对于代码密度的原因是有利的，并且由于微融合而在CPU的一部分中跟踪更少的uops 。在 -O0 中未启用优化过程，因此未优化的代码构建可能会与未对齐的src1发生故障。

The as-if rule still applies when optimizing code that uses intrinsics. A load can be folded into a memory operand for the vector-ALU instruction that uses it, as long as that doesn't introduce the risk of a fault. This is advantageous for code-density reasons, and also fewer uops to track in parts of the CPU thanks to micro-fusion (see Agner Fog's microarch.pdf). The optimization pass that does this isn't enabled at -O0, so an unoptimized build of your code probably would have faulted with unaligned src1.

在这种情况下，as-if规则的解释是程序在某些情况下没有故障，这种情况下，天真的翻译成asm会出错。（或者对于未优化构建中的相同代码故障，但在优化构建中不是故障）。

The interpretation of the as-if rule in this case is that it's ok for the program to not fault in some cases where the naive translation into asm would have faulted. (Or for the same code to fault in an un-optimized build but not fault in an optimized build).

这与浮点异常的规则相反，其中编译器生成的代码必须仍然提出将在C抽象机上发生的任何和所有异常。这是因为有很好的定义机制处理FP异常，但不是处理segfaults。

This is opposite from the rules for floating-point exceptions, where the compiler-generated code must still raise any and all exceptions that would have occurred on the C abstract machine. That's because there are well-defined mechanisms for handling FP exceptions, but not for handling segfaults.

注意，不会折叠到ALU指令的内存操作数中， store （不是 storeu ）内联函数将编译为

Note that since stores can't fold into memory operands for ALU instructions, store (not storeu) intrinsics will compile into code that faults with unaligned pointers even when compiling for an AVX target.

// aligned version:
y = ...;                         // assume it's in xmm1
x = _mm_load_si128(Aptr);        // Aligned pointer
res = _mm_or_si128(y, x);

// unaligned version: the same thing with _mm_loadu_si128(Uptr)

当针对SSE（代码可以在不AVX支持在CPU上运行），对准的版本可以折叠加载到 POR将xmm1，[APTR] ，但不对齐的版本有使用 movdqu xmm0，[Uptr] / por xmm0，xmm1 。如果在OR之后仍需要 y 的旧值，则对齐版本也可以这样做。

When targeting SSE (code that can run on CPUs without AVX support), the aligned version can fold the load into por xmm1, [Aptr], but the unaligned version has to use movdqu xmm0, [Uptr] / por xmm0, xmm1. The aligned version might do that too, if the old value of y is still needed after the OR.

（ gcc -mavx 或 gcc -march = sandybridge 或更高版本）位）将使用VEX编码。所以你得到不同的asm从相同的 _mm _... intrinsics。两个版本都可以编译成 vpor xmm0，xmm1，[ptr] 。（和三操作数非破坏性的功能，意味着这个实际发生时加载的原始值多次使用除外）。

When targeting AVX (gcc -mavx, or gcc -march=sandybridge or later), all vector instructions emitted (including 128 bit) will use the VEX encoding. So you get different asm from the same _mm_... intrinsics. Both versions can compile into vpor xmm0, xmm1, [ptr]. (And the 3-operand non-destructive feature means that this actually happens except when the original value loaded is used multiple times).

只有一个操作数ALU指令可以是内存操作数，因此在您的情况下，必须单独加载。您的代码错误时，第一个指针不对齐，但是不关心第二对齐，所以我们可以得出结论，GCC选择加载使用 vmovdqa 第一个操作数

Only one operand to ALU instructions can be a memory operand, so in your case one has to be loaded separately. Your code faults when the first pointer isn't aligned, but doesn't care about alignment for the second, so we can conclude that gcc chose to load the first operand with vmovdqa and fold the second, rather than vice-versa.

您可以在。不幸的是gcc的4.9（5.3）编译它产生于人的返回值，然后测试它，而不只是分支的标志从<$有所次优码c $ c> vptest :( clang-3.8有更好的工作。

You can see this happen in practice in your code on the Godbolt compiler explorer. Unfortunately gcc 4.9 (and 5.3) compile it to somewhat sub-optimal code that generates the return value in al and then tests it, instead of just branching on the flags from vptest :( clang-3.8 does a significantly better job.

.L36：
add rdi，32
。添加RSI，32
CMP RDI，RCX
济.L9
.L10：
vmovdqa XMM0，XMMWORD PTR [RDI]＃第一个参数：负载将在故障未对齐
XOR EAX，EAX
vpxor将xmm1，XMM0，XMMWORD PTR [RSI]＃第二ARG：不关心对准
vmovdqa XMM0负荷，XMMWORD PTR [RDI + 16]＃第一个参数
vpxor XMM0，XMM0，XMMWORD PTR [RSI + 16]＃第二ARG
VPOR XMM0，xmm1中，XMM0
vptest XMM0，XMM0
SETE人＃生成一个布尔a reg
test eax，eax
jne .L36＃then test& branch on it。 / facepalm

.L36: add rdi, 32 add rsi, 32 cmp rdi, rcx je .L9.L10: vmovdqa xmm0, XMMWORD PTR [rdi] # first arg: loads that will fault on unaligned xor eax, eax vpxor xmm1, xmm0, XMMWORD PTR [rsi] # second arg: loads that don't care about alignment vmovdqa xmm0, XMMWORD PTR [rdi+16] # first arg vpxor xmm0, xmm0, XMMWORD PTR [rsi+16] # second arg vpor xmm0, xmm1, xmm0 vptest xmm0, xmm0 sete al # generate a boolean in a reg test eax, eax jne .L36 # then test&branch on it. /facepalm

请注意，您的 is_equal 是 memcmp 。我认为glibc的memcmp在许多情况下会比你的实现更好，因为它有以及处理缓冲区相对于彼此未对齐的各种情况的其他版本。（例如一个对齐，一个不是。）请注意，glibc代码是LGPLed，所以你可能不能够复制它。如果你的用例有较小的缓冲区通常是对齐的，你的实现可能是好的。在从其他AVX代码调用它之前不需要VZEROUPPER也很好。

Note that your is_equal is memcmp. I think glibc's memcmp will do better than your implementation in many cases, since it has hand-written asm versions for SSE4.1 and others which handle various cases of the buffers being misaligned relative to each other. (e.g. one aligned, one not.) Note that glibc code is LGPLed, so you might not be able to just copy it. If your use-case has smaller buffers that are typically aligned, your implementation is probably good. Not needing a VZEROUPPER before calling it from other AVX code is also nice.

编译器生成的字节循环在末尾清除肯定是次优的。如果大小大于16字节，请执行以每个src的最后一个字节结束的未对齐加载。

The compiler-generated byte-loop to clean up at the end is definitely sub-optimal. If the size is bigger than 16 bytes, do an unaligned load that ends at the last byte of each src. It doesn't matter that you re-compared some bytes you've already checked.

无论如何，肯定是用系统来测试代码 memcmp 。除了库实现，gcc知道memcmp是什么，并有自己的内置定义，它可以内联代码。

Anyway, definitely benchmark your code with the system memcmp. Besides the library implementation, gcc knows what memcmp does and has its own builtin definition that it can inline code for.

这篇关于对齐和SSE奇怪的行为的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！