问题描述
如果允许高性能算法中的许多方法读取输入缓冲区末尾的少量内容,则它们可以(并且正在)简化.这里,少量"指的是通常意味着最多 W - 1
个字节,其中 W
是算法的字长(例如,处理输入的算法最多 7 个字节)在 64 位块中).
Many methods found in high-performance algorithms could be (and are) simplified if they were allowed to read a small amount past the end of input buffers. Here, "small amount" generally means up to W - 1
bytes past the end, where W
is the word size in bytes of the algorithm (e.g., up to 7 bytes for an algorithm processing the input in 64-bit chunks).
很明显,写入超过输入缓冲区的末尾永远不会安全,一般来说,因为您可能会破坏缓冲区之外的数据.很明显,越过缓冲区的末尾读取到另一个页面可能会触发分段错误/访问冲突,因为下一页可能无法读取.
It's clear that writing past the end of an input buffer is never safe, in general, since you may clobber data beyond the buffer. It is also clear that reading past the end of a buffer into another page may trigger a segmentation fault/access violation, since the next page may not be readable.
然而,在读取对齐值的特殊情况下,页面错误似乎是不可能的,至少在 x86 上是这样.在该平台上,页面(以及内存保护标志)具有 4K 粒度(更大的页面,例如 2MiB 或 1GiB,是可能的,但这些是 4K 的倍数),因此对齐读取将仅访问与有效页面相同的页面中的字节缓冲区的一部分.
In the special case of reading aligned values, however, a page fault seems impossible, at least on x86. On that platform, pages (and hence memory protection flags) have a 4K granularity (larger pages, e.g. 2MiB or 1GiB, are possible, but these are multiples of 4K) and so aligned reads will only access bytes in the same page as the valid part of the buffer.
以下是一些循环的规范示例,该循环对齐其输入并读取超过缓冲区末尾的 7 个字节:
Here's a canonical example of some loop that aligns its input and reads up to 7 bytes past the end of buffer:
int processBytes(uint8_t *input, size_t size) {
uint64_t *input64 = (uint64_t *)input, end64 = (uint64_t *)(input + size);
int res;
if (size < 8) {
// special case for short inputs that we aren't concerned with here
return shortMethod();
}
// check the first 8 bytes
if ((res = match(*input)) >= 0) {
return input + res;
}
// align pointer to the next 8-byte boundary
input64 = (ptrdiff_t)(input64 + 1) & ~0x7;
for (; input64 < end64; input64++) {
if ((res = match(*input64)) > 0) {
return input + res < input + size ? input + res : -1;
}
}
return -1;
}
内部函数 int match(uint64_t bytes)
未显示,但它会查找与特定模式匹配的字节,并返回最低的此类位置 (0-7),如果找到或 -1 否则.
The inner function int match(uint64_t bytes)
isn't shown, but it is something that looks for a byte matching a certain pattern, and returns the lowest such position (0-7) if found or -1 otherwise.
首先,大小小于为简单说明,8 个函数被典当为另一个函数.然后对前 8 个(未对齐的字节)进行一次检查.然后对剩余的 floor((size - 7)/8)
8 个字节的块 进行循环.这个循环最多可以读取超过缓冲区末尾的 7 个字节(当 input & 0xF == 1
时发生 7 字节的情况).然而,返回调用有一个检查,它排除了超出缓冲区末尾的任何虚假匹配.
First, cases with size < 8 are pawned off to another function for simplicity of exposition. Then a single check is done for the first 8 (unaligned bytes). Then a loop is done for the remaining floor((size - 7) / 8)
chunks of 8 bytes. This loop may read up to 7 bytes past the end of the buffer (the 7 byte case occurs when input & 0xF == 1
). However, return call has a check which excludes any spurious matches which occur beyond the end of the buffer.
实际上,这样的函数在 x86 和 x86-64 上安全吗?
这些类型的重读在高性能代码中很常见.避免这种重读的特殊尾码也很常见.有时您会看到后一种类型取代了前一种,以使 valgrind 等工具静音.有时您会看到一个建议进行这样的替换,但由于习语是安全的并且工具有错误(或过于保守)而被拒绝.
These types of overreads are common in high performance code. Special tail code to avoid such overreads is also common. Sometimes you see the latter type replacing the former to silence tools like valgrind. Sometimes you see a proposal to do such a replacement, which is rejected on the grounds the idiom is safe and the tool is in error (or simply too conservative).
给语言律师的注意事项:
A note for language lawyers:
绝对不允许读取超出其分配大小的指针在标准中.我很欣赏语言律师的回答,甚至偶尔会写他们自己,当有人挖出这一章时,我什至会很高兴和显示上面代码的诗是未定义的行为,因此严格意义上来说并不安全(我将在此处复制详细信息).最终,这不是什么我在追实际上,许多涉及指针的常见习语通过这样的指针进行转换、结构访问等技术上未定义,但在高质量和高性能代码.通常没有替代方案,或者替代方案以半速或更低速度运行.
如果您愿意,请考虑此问题的修改版本,即:
If you wish, consider a modified version of this question, which is:
以上代码编译为 x86/x86-64 程序集后,并且用户已验证它是按预期方式编译的(即编译器没有使用可证明的部分越界访问做一些真的聪明的,执行编译后的程序安全吗?
After the above code has been compiled to x86/x86-64 assembly, and the user has verified that it is compiled in the expected way (i.e.,the compiler hasn't used a provable partially out-of-bounds access todo something reallyclever,is executing the compiled program safe?
在这方面,这个问题既是一个 C 问题,也是一个 x86 汇编问题.我见过的大多数使用这个技巧的代码都是用 C 编写的,而 C 仍然是高性能库的主要语言,很容易超越像 asm 这样的低级东西,以及像 <everything else> 这样的高级东西.至少在 FORTRAN 仍然打球的硬核数字利基之外.所以我对这个问题的 C-compiler-and-below 观点很感兴趣,这就是为什么我没有将它表述为一个纯粹的 x86 汇编问题.
In that respect, this question is both a C question and a x86 assembly question. Most of the code using this trick that I've seen is written in C, and C is still the dominant language for high performance libraries, easily eclipsing lower level stuff like asm, and higher level stuff like <everything else>. At least outside of the hardcore numerical niche where FORTRAN still plays ball. So I'm interested in the C-compiler-and-below view of the question, which is why I didn't formulate it as a pure x86 assembly question.
说了这么多,虽然我只是对指向显示这是 UD 的标准,我对任何细节都很感兴趣可以使用此特定 UD 生成的实际实现意外的代码.现在我不认为这会在没有一些深度的情况下发生相当深入的跨过程分析,但 gcc 溢出的东西也让很多人感到惊讶...
All that said, while I am only moderately interested in a link to thestandard showing this is UD, I am very interested in any details ofactual implementations that can use this particular UD to produceunexpected code. Now I don't think this can happen without some deeppretty deep cross-procedure analysis, but the gcc overflow stuffsurprised a lot of people too...
即使在明显无害的情况下,例如,写回相同的值,它也可以破坏并发代码.
Even in apparently harmless cases, e.g., where the same value is written back, it can break concurrent code.
注意此重叠工作需要此函数和 match()
函数以特定的幂等方式运行 - 特别是返回值支持重叠检查.所以找到第一个字节匹配模式"有效,因为所有 match()
调用仍然是有序的.计数字节匹配模式"但是,方法将不起作用,因为某些字节可能会被重复计算.顺便说一句:一些函数,例如返回最小字节".即使没有顺序限制,调用也能工作,但需要检查所有字节.
Note for this overlapping to work requires that this function and match()
function to behave in a specific idempotent way - in particular that the return value supports overlapping checks. So a "find first byte matching pattern" works since all the match()
calls are still in-order. A "count bytes matching pattern" method would not work, however, since some bytes could be double counted. As an aside: some functions such as "return the minimum byte" call would work even without the in-order restriction, but need to examine all bytes.
这里值得注意的是,对于 valgrind 的 Memcheck 有一个标志,--partial-loads-ok
用于控制此类读取实际上是否报告为错误.默认为 yes,这意味着通常不会将此类加载视为立即错误,但会努力跟踪加载字节的后续使用,其中一些是有效的,一些是不,如果超出范围的字节被使用,则会标记错误.在上述示例中,在 match()
中访问整个单词的情况下,即使结果最终被丢弃,这种分析也会得出字节被访问的结论.Valgrind 一般无法确定是否实际使用了部分加载中的无效字节(通常检测可能非常.
It's worth noting here that for valgrind's Memcheck there is a flag, --partial-loads-ok
which controls whether such reads are in fact reported as an error. The default is yes, means that in general such loads are not treated as immediate errors, but that an effort is made to track the subsequent use of loaded bytes, some of which are valid and some of which are not, with an error being flagged if the out-of-range bytes are used. In cases such as the example above, in which the entire word is accessed in match()
, such analysis will conclude the bytes are accessed, even though the results are ultimately discarded. Valgrind cannot in general determine whether invalid bytes from a partial load are actually used (and detection in general is probably very hard).
推荐答案
是的,它在 x86 asm 和 现有的 libc 中是安全的 strlen(3)
实现在手写 asm 中利用了这一点. 甚至 glibc 的后备 C,但它编译没有 LTO 所以它永远不能内联.它基本上是使用 C 作为便携式汇编程序来为一个函数创建机器代码,而不是作为具有内联的较大 C 程序的一部分.但这主要是因为它也有潜在的严格别名 UB,请参阅我在链接问答中的回答.您可能还需要 GNU C __attribute__((may_alias))
typedef 而不是普通的 unsigned long
作为更广泛的类型,例如 __m128i
等已经使用.
Yes, it's safe in x86 asm, and existing libc strlen(3)
implementations take advantage of this in hand-written asm. And even glibc's fallback C, but it compiles without LTO so it it can never inline. It's basically using C as a portable assembler to create machine code for one function, not as part of a larger C program with inlining. But that's mostly because it also has potential strict-aliasing UB, see my answer on the linked Q&A. You probably also want a GNU C __attribute__((may_alias))
typedef instead of plain unsigned long
as your wider type, like __m128i
etc. already use.
这是安全的,因为对齐的加载永远不会跨越更高的对齐边界,并且内存保护发生在对齐的页面上,因此至少有 4k 个边界接触至少 1 个有效字节的任何自然对齐的加载都不会出错. 检查您是否离下一页边界足够远以执行 16 字节加载也是安全的,例如if (p & 4095 > (4096 - 16)) do_special_case_fallback
.有关更多详细信息,请参阅以下部分.
It's safe because an aligned load will never cross a higher alignment boundary, and memory protection happens with aligned pages, so at least 4k boundariesAny naturally-aligned load that touches at least 1 valid byte can't fault. It's also safe to just check if you're far enough from the next page boundary to do a 16-byte load, like if (p & 4095 > (4096 - 16)) do_special_case_fallback
. See the section below about that for more detail.
据我所知,在为 x86 编译的 C 中通常也是安全的.在对象外读取当然是 C 中的未定义行为,但在 C-targeting-x86 中有效.我不认为编译器明确/故意定义行为,但实际上它是这样工作的.
It's also generally safe in C compiled for x86, as far as I know. Reading outside an object is of course Undefined Behaviour in C, but works in C-targeting-x86. I don't think compilers explicitly / on purpose define the behaviour, but in practice it works that way.
我认为激进的编译器不会假设在优化时不会发生,但是编译器编写者在这一点上的确认会很好,特别是对于在编译时很容易证明访问超出了一个对象的结尾.(请参阅@RossRidge 评论中的讨论:此答案的先前版本断言它是绝对安全的,但 LLVM 博客文章并不是这样读的).
I think it's not the kind of UB that aggressive compilers will assume can't happen while optimizing, but confirmation from a compiler-writer on this point would be good, especially for cases where it's easily provable at compile-time that an access goes out of past the end of an object. (See discussion in comments with @RossRidge: a previous version of this answer asserted that it was absolutely safe, but that LLVM blog post doesn't really read that way).
这是要求在 asm 中一次处理一个隐式长度字符串的速度超过 1 个字节.在 C 中,理论上编译器可以知道如何优化这样的循环,但实际上他们不知道,所以你必须像这样进行 hack.在此更改之前,我怀疑人们关心的编译器通常会避免破坏包含此潜在 UB 的代码.
This is required in asm to go faster than 1 byte at a time processing an implicit-length string. In C in theory a compiler could know how to optimize such a loop, but in practice they don't so you have to do hacks like this. Until that changes, I suspect that the compilers people care about will generally avoid breaking code that contains this potential UB.
当知道对象有多长的代码看不到重读时,就不会有危险.编译器必须使 asm 适用于我们实际读取的数组元素的情况.对于未来可能的编译器,我可以看到的似是而非的危险是: 内联后,编译器可能会看到 UB 并决定绝不能采用这条执行路径.或者必须在最终的非完整向量之前找到终止条件,并在完全展开时将其排除.
There's no danger when the overread isn't visible to code that knows how long an object is. A compiler has to make asm that works for the case where there are array elements as far as we actually read. The plausible danger I can see with possible future compilers is: after inlining, a compiler might see the UB and decide that this path of execution must never be taken. Or that the terminating condition must be found before the final not-full-vector and leave that out when fully unrolling.
你得到的数据是不可预测的垃圾,但不会有任何其他潜在的副作用.只要您的程序不受垃圾字节的影响,就可以.(例如,使用 bithacks 来查找 uint64_t代码>为零
,然后一个字节循环来查找第一个零字节,不管它后面是什么垃圾.)
The data you get is unpredictable garbage, but there won't be any other potential side-effects. As long as the your program isn't affected by the garbage bytes, it's fine. (e.g. use bithacks to find if one of the bytes of a uint64_t
are zero, then a byte loop to find the first zero byte, regardless of what garbage is beyond it.)
硬件数据断点(观察点)给定地址.如果您在数组之后立即监视变量,则可能会出现虚假命中.对于调试正常程序的人来说,这可能是一个小烦恼.如果您的函数将成为使用 x86 调试寄存器 D0-D3 的程序的一部分,以及由此产生的可能影响正确性的异常,那么请注意这一点.
Hardware data breakpoints (watchpoints) that trigger on a load from a given address. If there's a variable you're monitoring right after an array, you could get a spurious hit. This might be a minor annoyance to someone debugging a normal program. If your function will be part of a program that uses x86 debug registers D0-D3 and the resulting exceptions for something that could affect correctness, then be careful with this.
或者类似地,像 valgrind 这样的代码检查器可能会抱怨读取对象之外的内容.
Or similarly a code checker like valgrind could complain about reading outside an object.
在使用分段的假设 16 位或 32 位操作系统下:分段限制可以使用 4k 或 1 字节粒度,因此可以创建第一个错误偏移为奇数的段.(除了性能之外,将段的基址与缓存行或页面对齐是无关紧要的).所有主流 x86 操作系统都使用平面内存模型,并且 x86-64 取消了对 64 位模式的段限制的支持.
Under a hypothetical 16 or 32-bit OS could that uses segmentation: A segment limit can use 4k or 1-byte granularity so it's possible to create a segment where the first faulting offset is odd. (Having the base of the segment aligned to a cache line or page is irrelevant except for performance). All mainstream x86 OSes use flat memory models, and x86-64 removes support for segment limits for 64-bit mode.
紧跟在缓冲区之后的内存映射 I/O 寄存器您希望以宽负载循环,尤其是相同的 64B 缓存行.即使您从设备驱动程序(或像 X 服务器这样映射了一些 MMIO 空间的用户空间程序)调用这样的函数,这也是极不可能的.
Memory-mapped I/O registers right after the buffer you wanted to loop over with wide loads, especially the same 64B cache-line. This is extremely unlikely even if you're calling functions like this from a device driver (or a user-space program like an X server that has mapped some MMIO space).
如果您正在处理一个 60 字节的缓冲区并且需要避免从 4 字节的 MMIO 寄存器中读取数据,那么您就会知道这一点并且将使用 volatile T*
.普通代码不会出现这种情况.
If you're processing a 60-byte buffer and need to avoid reading from a 4-byte MMIO register, you'll know about it and will be using a volatile T*
. This sort of situation doesn't happen for normal code.
strlen
是处理隐式长度缓冲区的循环的典型示例,因此无法在不读取缓冲区末尾的情况下进行矢量化.如果您需要避免读取超过终止的 0
字节,则一次只能读取一个字节.
strlen
is the canonical example of a loop that processes an implicit-length buffer and thus can't vectorize without reading past the end of a buffer. If you need to avoid reading past the terminating 0
byte, you can only read one byte at a time.
例如,glibc 的实现使用序言来处理直到第一个 64B 对齐边界的数据.然后在主循环 (gitweb 链接到 asm 源),它使用四个 SSE2 对齐加载加载整个 64B 缓存线.它使用 pminub
(最小无符号字节),因此只有当四个向量中的任何一个具有零时,最终向量才会具有零元素.在发现字符串的末尾在该缓存行的某个位置后,它会分别重新检查四个向量中的每一个以查看位置.(使用典型的 pcmpeqb
全零向量,和 pmovmskb
/bsf代码>
来找到向量中的位置.)glibc 过去有几个不同的 strlen 策略可供选择,但目前的策略适用于所有 x86-64 CPU.
For example, glibc's implementation uses a prologue to handle data up to the first 64B alignment boundary. Then in the main loop (gitweb link to the asm source), it loads a whole 64B cache line using four SSE2 aligned loads. It merges them down to one vector with pminub
(min of unsigned bytes), so the final vector will have a zero element only if any of the four vectors had a zero. After finding that the end of the string was somewhere in that cache line, it re-checks each of the four vectors separately to see where. (Using the typical pcmpeqb
against a vector of all-zero, and pmovmskb
/ bsf
to find the position within the vector.) glibc used to have a couple different strlen strategies to choose from, but the current one is good on all x86-64 CPUs.
出于性能原因,像这样的循环通常会避免触及任何他们不需要触及的额外缓存行,而不仅仅是页面,例如 glibc 的 strlen.
Usually loops like this avoid touching any extra cache-lines they don't need to touch, not just pages, for performance reasons, like glibc's strlen.
一次加载 64B 当然只对 64B 对齐的指针是安全的,因为自然对齐的访问不能跨越 缓存行或页行边界.
Loading 64B at a time is of course only safe from a 64B-aligned pointer, since naturally-aligned accesses can't cross cache-line or page-line boundaries.
如果您提前知道缓冲区的长度,则可以通过使用以最后一个字节结束的未对齐加载来处理最后一个完全对齐向量之外的字节,从而避免读到末尾的缓冲区.
If you do know the length of a buffer ahead of time, you can avoid reading past the end by handling the bytes beyond the last full aligned vector using an unaligned load that ends at the last byte of the buffer.
(同样,这仅适用于幂等算法,例如 memcpy,它们不关心它们是否将存储重叠到目的地.就地修改算法通常无法做到这一点,除非使用诸如 ,其中可以重新处理已经向上转换的数据.如果您执行与上次对齐的存储重叠的未对齐加载,则除了存储转发停顿.)
(Again, this only works with idempotent algorithms, like memcpy, which don't care if they do overlapping stores into the destination. Modify-in-place algorithms often can't do this, except with something like converting a string to upper-case with SSE2, where it's ok to reprocess data that's already been upcased. Other than the store-forwarding stall if you do an unaligned load that overlaps with your last aligned store.)
因此,如果您在已知长度的缓冲区上进行矢量化,通常最好避免过度读取.
So if you are vectorizing over a buffer of known length, it's often best to avoid overread anyway.
对象的无故障重读是一种 UB,如果编译器在编译时看不到它,它绝对不会受到伤害.生成的 asm 将像额外的字节是某个对象的一部分一样工作.
Non-faulting overread of an object is the kind of UB that definitely can't hurt if the compiler can't see it at compile time. The resulting asm will work as if the extra bytes were part of some object.
但即使它在编译时可见,它通常也不会受到当前编译器的影响.
But even if it is visible at compile-time, it generally doesn't hurt with current compilers.
PS:此答案的先前版本声称 int *
的未对齐 deref 在为 x86 编译的 C 中也是安全的.不是真的.3年前写那部分时,我有点太傲慢了.您需要一个 __attribute__((aligned(1)))
typedef 或 memcpy
,以确保安全.
PS: a previous version of this answer claimed that unaligned deref of int *
was also safe in C compiled for x86. That is not true. I was a bit too cavalier 3 years ago when writing that part. You need a __attribute__((aligned(1)))
typedef, or memcpy
, to make that safe.
ISO C 未定义但英特尔内在函数要求编译器定义的一组事物确实包括创建未对齐的指针(至少对于像 __m128i*
这样的类型),但不直接取消引用它们.是否在硬件 SIMD 向量指针之间进行`reinterpret_cast`而对应的类型是未定义的行为?
The set of things ISO C leaves undefined but that Intel intrinsics requires compilers to define does include creating unaligned pointers (at least with types like __m128i*
), but not dereferencing them directly. Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?
这对 strlen 的第一个向量很有用;在此之后你可以 p = (p+16) &-16
转到下一个对齐的向量.如果 p
不是 16 字节对齐,这将部分重叠,但有时进行冗余工作是设置高效循环的最紧凑方式.避免它可能意味着一次循环 1 个字节直到对齐边界,这当然更糟.
This is useful for the first vector of strlen; after this you can p = (p+16) & -16
to go to the next aligned vector. This will partially overlap if p
was not 16-byte aligned, but doing redundant work is sometimes the most compact way to set up for an efficient loop. Avoiding it might mean looping 1 byte at a time until an alignment boundary, and that's certainly worse.
例如检查 ((p + 15) ^ p) &0xFFF...F000 == 0 (LEA/XOR/TEST) 它告诉您 16 字节加载的最后一个字节与第一个字节具有相同的页面地址位.或 p+15 <= p|0xFFF
(LEA/OR/CMP 具有更好的 ILP)检查加载的最后一个字节地址是
e.g. check ((p + 15) ^ p) & 0xFFF...F000 == 0
(LEA / XOR / TEST) which tells you that the last byte of a 16-byte load has the same page-address bits as the first byte. Or p+15 <= p|0xFFF
(LEA / OR / CMP with better ILP) checks that the last byte-address of the load is <= the last byte of the page containing the first byte.
或者更简单地说,p &4095 >(4096 - 16)
(MOV/AND/CMP),即 p &(pgsize-1) 检查页内偏移距页尾是否足够远.
Or more simply, p & 4095 > (4096 - 16)
(MOV / AND / CMP), i.e. p & (pgsize-1) < (pgsize - vecwidth)
checks that the offset-within-page is far enough from the end of a page.
您可以使用 32 位操作数大小来保存此检查或任何其他检查的代码大小(REX 前缀),因为高位无关紧要.一些编译器不会注意到这种优化,因此您可以强制转换为 unsigned int
而不是 uintptr_t
,尽管为了消除有关不是 64 位干净的代码的警告,您可能会需要转换 (unsigned)(uintptr_t)p
.使用 ((unsigned int)p < 可以进一步节省代码大小.((4096 - vectorlen) << 20)
(MOV/SHL/CMP),因为 shl reg, 20
是 3 个字节,而 和 eax, imm32
为 5,或者任何其他寄存器为 6.(使用 EAX 也将允许 cmp eax, 0xfff
的 no-modrm 短格式.)
You can use 32-bit operand-size to save code size (REX prefixes) for this or any of the other checks because the high bits don't matter. Some compilers don't notice this optimization, so you can cast to unsigned int
instead of uintptr_t
, although to silence warnings about code that isn't 64-bit clean you might need to cast (unsigned)(uintptr_t)p
. Further code-size saving can be had with ((unsigned int)p << 20) > ((4096 - vectorlen) << 20)
(MOV / SHL / CMP), because shl reg, 20
is 3 bytes, vs. and eax, imm32
being 5, or 6 for any other register. (Using EAX will also allow the no-modrm short form for cmp eax, 0xfff
.)
如果在 GNU C 中执行此操作,您可能希望 typedef unsigned long aliasing_unaligned_ulong __attribute__((aligned(1),may_alias));
使其安全地进行未对齐的访问.
If doing this in GNU C, you probably want typedef unsigned long aliasing_unaligned_ulong __attribute__((aligned(1),may_alias));
to make it safe to do unaligned accesses.
这篇关于在 x86 和 x64 上的同一页面内读取缓冲区的末尾是否安全?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!