问题描述
我有一段代码,在 AMD64 兼容 CPU 上的 Ubuntu 14.04 上运行时会出现段错误:
I have this piece of code which segfaults when run on Ubuntu 14.04 on an AMD64 compatible CPU:
#include <inttypes.h>
#include <stdlib.h>
#include <sys/mman.h>
int main()
{
uint32_t sum = 0;
uint8_t *buffer = mmap(NULL, 1<<18, PROT_READ,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
uint16_t *p = (buffer + 1);
int i;
for (i=0;i<14;++i) {
//printf("%d
", i);
sum += p[i];
}
return sum;
}
这仅在使用 mmap
分配内存时才会出现段错误.如果我使用 malloc
、堆栈上的缓冲区或全局变量,它不会出现段错误.
This only segfaults if the memory is allocated using mmap
. If I use malloc
, a buffer on the stack, or a global variable it does not segfault.
如果我将循环的迭代次数减少到小于 14 的任何值,它就不再出现段错误.如果我从循环内打印数组索引,它也不再出现段错误.
If I decrease the number of iterations of the loop to anything less than 14 it no longer segfaults. And if I print the array index from within the loop it also no longer segfaults.
为什么在能够访问未对齐地址的 CPU 上未对齐内存访问段错误,为什么只在这种特定情况下?
Why does unaligned memory access segfault on a CPU that is able to access unaligned addresses, and why only under such specific circumstances?
推荐答案
相关:Pascal Cuoq 的博客文章展示了 GCC 假定指针对齐的情况(两个 int*
不部分重叠):GCC 始终假设对齐的指针访问.他还链接到 2016 年的博客文章(A错误故事:x86 上的数据对齐)与此问题具有完全相同的错误:使用未对齐的指针进行自动矢量化 -> 段错误.
Related: Pascal Cuoq's blog post shows a case where GCC assumes aligned pointers (that two int*
don't partially overlap): GCC always assumes aligned pointer accesses. He also links to a 2016 blog post (A bug story: data alignment on x86) that has the exact same bug as this question: auto-vectorization with a misaligned pointer -> segfault.
gcc4.8 制作了一个试图到达对齐边界的循环序言,但它假设 uint16_t *p
是 2 字节对齐的,即某些数量的标量迭代将使指针对齐 16 字节.
gcc4.8 makes a loop prologue that tries to reach an alignment boundary, but it assumes that uint16_t *p
is 2-byte aligned, i.e. that some number of scalar iterations will make the pointer 16-byte aligned.
我不认为 gcc 曾经打算支持 x86 上的未对齐指针,它恰好适用于没有自动矢量化的非原子类型.在 ISO C 中使用指向 uint16_t
的指针而小于 alignof(uint16_t)=2
对齐绝对是未定义的行为.GCC 在编译时看到您违反规则时不会发出警告,并且实际上碰巧制作了工作代码(对于 malloc
,它知道返回值最小对齐),但那是 大概只是一个意外gcc 内部结构,不应被视为支持"的指示.
I don't think gcc ever intended to support misaligned pointers on x86, it just happened to work for non-atomic types without auto-vectorization. It's definitely undefined behaviour in ISO C to use a pointer to uint16_t
with less than alignof(uint16_t)=2
alignment. GCC doesn't warn when it can see you breaking the rule at compile time, and actually happens to make working code (for malloc
where it knows the return-value minimum alignment), but that's presumably just an accident of the gcc internals, and shouldn't be taken as an indication of "support".
尝试使用 -O3 -fno-tree-vectorize
或 -O2
.如果我的解释是正确的,那不会是段错误,因为它只会使用标量负载(正如您在 x86 上所说的那样,没有任何对齐要求).
Try with -O3 -fno-tree-vectorize
or -O2
. If my explanation is correct, that won't segfault, because it will only use scalar loads (which as you say on x86 don't have any alignment requirements).
gcc 知道 malloc
在这个目标上返回 16 字节对齐的内存(x86-64 Linux,其中 maxalign_t
是 16 字节宽,因为 long double
> 在 x86-64 System V ABI 中填充了 16 个字节).它会看到您在做什么并使用 movdqu
.
gcc knows malloc
returns 16-byte aligned memory on this target (x86-64 Linux, where maxalign_t
is 16 bytes wide because long double
has padding out to 16 bytes in the x86-64 System V ABI). It sees what you're doing and uses movdqu
.
但是 gcc 并不将 mmap
视为内置函数,因此它不知道它返回页面对齐的内存,并应用其通常的自动矢量化策略,该策略显然假定 uint16_t *p
是 2 字节对齐的,因此它可以在处理错位后使用 movdqa
.您的指针未对齐并违反了此假设.
But gcc doesn't treat mmap
as a builtin, so it doesn't know that it returns page-aligned memory, and applies its usual auto-vectorization strategy which apparently assumes that uint16_t *p
is 2-byte aligned, so it can use movdqa
after handling misalignment. Your pointer is misaligned and violates this assumption.
(我想知道较新的 glibc 标头是否使用 __attribute__((assume_aligned(4096)))
将 mmap
的返回值标记为对齐.那将是一个好主意, 并且可能会为您提供与 malloc
相同的代码生成.除了它不起作用,因为它会破坏 mmap != (void*)-1 的错误检查
, 正如@Alcaro 指出的 以 Godbolt 为例:https://gcc.Godbolt.org/z/gVrLWT)
(I wonder if newer glibc headers use __attribute__((assume_aligned(4096)))
to mark mmap
's return value as aligned. That would be a good idea, and would probably have given you about the same code-gen as for malloc
. Except it wouldn't work because it would break error-checking for mmap != (void*)-1
, as @Alcaro points out with an example on Godbolt: https://gcc.godbolt.org/z/gVrLWT)
在能够访问未对齐的 CPU 上
SSE2 movdqa
未对齐的段错误,并且您的元素本身未对齐,因此您会遇到不寻常的情况,即没有数组元素以 16 字节边界开始.
SSE2 movdqa
segfaults on unaligned, and your elements are themselves misaligned so you have the unusual situation where no array element starts at a 16-byte boundary.
SSE2 是 x86-64 的基线,所以 gcc 使用它.
SSE2 is baseline for x86-64, so gcc uses it.
Ubuntu 14.04LTS 使用 gcc4.8.2(离题:这是旧的和过时的,在许多情况下,代码生成比 gcc5.4 或 gcc6.4 更糟糕,尤其是在自动矢量化时.它甚至无法识别 -march=haswell
.)
Ubuntu 14.04LTS uses gcc4.8.2 (Off topic: which is old and obsolete, worse code-gen in many cases than gcc5.4 or gcc6.4 especially when auto-vectorizing. It doesn't even recognize -march=haswell
.)
14 是 gcc 启发式算法决定在此函数中自动向量化循环的最小阈值,使用 -O3
而没有 -march
或 -mtune
选项.
14 is the minimum threshold for gcc's heuristics to decide to auto-vectorize your loop in this function, with -O3
and no -march
or -mtune
options.
我把你的代码 在 Godbolt 上,这是 main
的相关部分:
I put your code on Godbolt, and this is the relevant part of main
:
call mmap #
lea rdi, [rax+1] # p,
mov rdx, rax # buffer,
mov rax, rdi # D.2507, p
and eax, 15 # D.2507,
shr rax ##### rax>>=1 discards the low byte, assuming it's zero
neg rax # D.2507
mov esi, eax # prolog_loop_niters.7, D.2507
and esi, 7 # prolog_loop_niters.7,
je .L2
# .L2 leads directly to a MOVDQA xmm2, [rdx+1]
它计算出(使用此代码块)在到达 MOVDQA 之前要执行多少次标量迭代,但没有任何代码路径会导致 MOVDQU 循环.即 gcc 没有代码路径来处理 p
为奇数的情况.
It figures out (with this block of code) how many scalar iterations to do before reaching MOVDQA, but none of the code paths lead to a MOVDQU loop. i.e. gcc doesn't have a code path to handle the case where p
is odd.
但是 malloc 的代码生成看起来像这样:
call malloc #
movzx edx, WORD PTR [rax+17] # D.2497, MEM[(uint16_t *)buffer_5 + 17B]
movzx ecx, WORD PTR [rax+27] # D.2497, MEM[(uint16_t *)buffer_5 + 27B]
movdqu xmm2, XMMWORD PTR [rax+1] # tmp91, MEM[(uint16_t *)buffer_5 + 1B]
注意movdqu
的使用.还混入了更多的标量 movzx
负载:14 次总迭代中的 8 次是 SIMD,其余 6 次是标量.这是一个遗漏的优化:它可以很容易地用 movq
加载再做 4 个,特别是因为它在解包后填充了一个 XMM 向量在添加之前用零获取 uint32_t 元素.
Note the use of movdqu
. There are some more scalar movzx
loads mixed in: 8 of the 14 total iterations are done SIMD, and the remaining 6 with scalar. This is a missed-optimization: it could easily do another 4 with a movq
load, especially because that fills an XMM vector after unpacking with zero to get uint32_t elements before adding.
(还有其他各种遗漏的优化,例如可能使用 pmaddwd
和 1
的乘数将水平词对添加到 dword 元素中.)
(There are various other missed-optimizations, like maybe using pmaddwd
with a multiplier of 1
to add horizontal pairs of words into dword elements.)
如果您确实想编写使用未对齐指针的代码,您可以在 ISO C 中使用 memcpy
正确完成.在具有高效未对齐加载支持的目标(如 x86)上,现代编译器仍将使用简单的标量加载到寄存器中,就像取消引用指针一样.但是在自动矢量化时,gcc 不会假设对齐的指针与元素边界对齐,而是使用未对齐的加载.
If you do want to write code which uses unaligned pointers, you can do it correctly in ISO C using memcpy
. On targets with efficient unaligned load support (like x86), modern compilers will still just use a simple scalar load into a register, exactly like dereferencing the pointer. But when auto-vectorizing, gcc won't assume that an aligned pointer lines up with element boundaries and will use unaligned loads.
memcpy
是您在 ISO C/C++ 中表达未对齐加载/存储的方式.
memcpy
is how you express an unaligned load / store in ISO C / C++.
#include <string.h>
int sum(int *p) {
int sum=0;
for (int i=0 ; i<10001 ; i++) {
// sum += p[i];
int tmp;
#ifdef USE_ALIGNED
tmp = p[i]; // normal dereference
#else
memcpy(&tmp, &p[i], sizeof(tmp)); // unaligned load
#endif
sum += tmp;
}
return sum;
}
使用 gcc7.2 -O3 -DUSE_ALIGNED
,我们得到通常的标量,直到对齐边界,然后是向量循环:( )
With gcc7.2 -O3 -DUSE_ALIGNED
, we get the usual scalar until an alignment boundary, then a vector loop: (Godbolt compiler explorer)
.L4: # gcc7.2 normal dereference
add eax, 1
paddd xmm0, XMMWORD PTR [rdx]
add rdx, 16
cmp ecx, eax
ja .L4
但是使用 memcpy
,我们获得了带有未对齐加载的自动矢量化(没有 intro/outro 来处理对齐),这与 gcc 的正常偏好不同:
But with memcpy
, we get auto-vectorization with an unaligned load (with no intro/outro to handle alignement), unlike gcc's normal preference:
.L2: # gcc7.2 memcpy for an unaligned pointer
movdqu xmm2, XMMWORD PTR [rdi]
add rdi, 16
cmp rax, rdi # end_pointer != pointer
paddd xmm0, xmm2
jne .L2 # -mtune=generic still doesn't optimize for macro-fusion of cmp/jcc :(
# hsum into EAX, then the final odd scalar element:
add eax, DWORD PTR [rdi+40000] # this is how memcpy compiles for normal scalar code, too.
在 OP 的情况下,简单地安排要对齐的指针是更好的选择.它避免了标量代码的缓存行拆分(或以 gcc 的方式进行矢量化).不会花费很多额外的内存或空间,内存中的数据布局也不是固定的.
In the OP's case, simply arranging for pointers to be aligned is a better choice. It avoids cache-line splits for scalar code (or for vectorized the way gcc does it). It doesn't cost a lot of extra memory or space, and the data layout in memory isn't fixed.
但有时这不是一种选择.当您复制原始类型的所有字节时,memcpy
相当可靠地使用现代 gcc/clang 完全优化掉.即只是加载或存储,没有函数调用,也没有弹跳到额外的内存位置.即使在 -O0
,这个简单的 memcpy
内联没有函数调用,但当然 tmp
不会优化掉.
But sometimes that's not an option. memcpy
fairly reliably optimizes away completely with modern gcc / clang when you copy all the bytes of a primitive type. i.e. just a load or store, no function call and no bouncing to an extra memory location. Even at -O0
, this simple memcpy
inlines with no function call, but of course tmp
doesn't optimizes away.
无论如何,如果您担心它可能无法在更复杂的情况下或使用不同的编译器进行优化,请检查编译器生成的 asm.例如,ICC18 不会使用 memcpy 自动矢量化版本.
Anyway, check the compiler-generated asm if you're worried that it might not optimize away in a more complicated case, or with different compilers. For example, ICC18 doesn't auto-vectorize the version using memcpy.
uint64_t tmp=0;
然后低 3 个字节的 memcpy 编译为实际复制到内存并重新加载,因此这不是表达奇数类型的零扩展的好方法,例如.
uint64_t tmp=0;
and then memcpy over the low 3 bytes compiles to an actual copy to memory and reload, so that's not a good way to express zero-extension of odd-sized types, for example.
代替 memcpy
(当 GCC 不知道指针对齐时,它不会在某些 ISA 上内联,即正是这个用例),您还可以使用带有 GCC 的 typedef属性来制作类型的欠对齐版本.
Instead of memcpy
(which won't inline on some ISAs when GCC doesn't know the pointer is aligned, i.e. exactly this use-case), you can also use a typedef with a GCC attribute to make an under-aligned version of a type.
typedef int __attribute__((aligned(1), may_alias)) unaligned_aliasing_int;
typedef unsigned long __attribute__((may_alias, aligned(1))) unaligned_aliasing_ulong;
相关:为什么 glibc 的 strlen 需要如此复杂才能快速运行? 展示了如何使用此方法使一次一个字的 bithack C strlen 安全.
related: Why does glibc's strlen need to be so complicated to run quickly? shows how to make a word-at-a-time bithack C strlen safe with this.
请注意,ICC 似乎不尊重 __attribute__((may_alias))
,但 gcc/clang 会这样做.我最近正在尝试编写一个可移植且安全的 4 字节 SIMD 加载,例如 _mm_loadu_si32
(缺少 GCC).https://godbolt.org/z/ydMLCK 有各种安全的组合,但代码生成效率低下在某些编译器上,或者在 ICC 上不安全,但在任何地方都很好.
Note that it seems ICC doesn't respect __attribute__((may_alias))
, but gcc/clang do. I was recently playing around with that trying to write a portable and safe 4-byte SIMD load like _mm_loadu_si32
(which GCC is missing). https://godbolt.org/z/ydMLCK has various combinations of safe everywhere but inefficient code-gen on some compilers, or unsafe on ICC but good everywhere.
aligned(1)
在像 MIPS 这样无法在一条指令中完成未对齐加载的 ISA 上可能不如 memcpy 糟糕.
aligned(1)
may be less bad than memcpy on ISAs like MIPS where unaligned loads can't be done in one instruction.
您可以像使用任何其他指针一样使用它.
You use it like any other pointer.
unaligned_aliasing_int *p = something;
int tmp = *p++;
int tmp2 = *p++;
当然,您可以像 p[i]
一样正常索引它.
And of course you can index it as normal like p[i]
.
这篇关于为什么在 AMD64 上对 mmap'ed 内存的未对齐访问有时会出现段错误?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!