使用“推"或“子" x86指令时如何分配堆栈内存?

本文介绍了使用“推"或“子" x86指令时如何分配堆栈内存?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经浏览了一段时间，并且尝试了解例如在执行操作时如何将内存分配给堆栈:

push rax

或移动堆栈指针为子例程的局部变量分配空间:

sub rsp, X    ;Move stack pointer down by X bytes

我了解的是，堆栈段在虚拟内存空间中是匿名的，即不是文件支持的.

我还了解到，内核不会真正将匿名虚拟内存段映射到物理内存，直到程序对该内存段实际执行某些操作(即写入数据)为止.因此，尝试在写入该段之前先读取该段可能会导致错误.

在第一个示例中，如果需要，内核将在物理内存中分配一个框架页.在第二个示例中，我假设内核不会将任何物理内存分配给堆栈段，直到程序实际将数据写入堆栈堆栈段中的地址为止.

我在这里正确吗?

解决方案

是的，您在这里的方向正确. sub rsp, X有点像惰性"分配:在#PF页面错误异常之后，内核只能通过触摸新RSP上方的内存来执行任何操作，而不仅仅是修改寄存器.但是您仍然可以考虑分配"内存，即可以安全使用.

否，读取不会导致错误.不管是在BSS，堆栈还是mmap(MAP_ANONYMOUS)中，从未写入的匿名页面都将写时复制映射到一个物理零页面.

有趣的事实:在微基准测试中，请确保触摸存储阵列的每一页内存，否则实际上是在相同的4k或2M物理零页上反复循环，即使您仍然使用L1D缓存也是如此遇到TLB未命中(以及软页面错误)！ gcc会将malloc + memset(0)优化为calloc，但是std::vector实际上将写入所有内存，无论您是否愿意.全局数组上的memset尚未优化，因此可以正常工作. (或者非零初始化的数组将在数据段中作为文件支持.)

注意，我忽略了映射和有线之间的区别.即访问是否会触发软/次要页面错误以更新页面表，还是仅仅是TLB未命中，而硬件页面表遍历会找到映射(到零页面).

但是RSP之下的堆栈内存可能根本没有映射，因此在不先移动RSP的情况下对其进行触摸可能是无效的页面错误，而不是次要"页面错误以解决按需复制的问题.写.

堆栈内存有一个有趣的变化:堆栈大小限制约为8MB(ulimit -s)，但是在Linux中，进程第一个线程的初始堆栈很特殊.例如，我在hello-world(动态链接)可执行文件的_start中设置了一个断点，并查看了/proc/<PID>/smaps:

7ffffffde000-7ffffffff000 rw-p 00000000 00:00 0                          [stack]
Size:                132 kB
Rss:                   8 kB
Pss:                   8 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         8 kB
Referenced:            8 kB
Anonymous:             8 kB
...

仅8kiB的堆栈已被引用，并由物理页面支持.这是预料之中的，因为动态链接程序不会使用很多堆栈.

甚至只有132kiB的堆栈被映射到了进程的虚拟地址空间中.但是特殊的魔术阻止了mmap(NULL, ...)从8MiB的虚拟地址空间中随机选择页面来扩展堆栈.mmap(NULL, ...) /p>

在当前堆栈映射下方但在堆栈限制内触摸内存 (在页面错误处理程序中).

(但 ，只有先调整rsp ；红色-zone 仅位于rsp以下128个字节，因此ulimit -s unlimited不会使rsp以下1GB的触摸内存增长堆栈到那里，.)

这仅适用于初始/主线程的堆栈. pthreads仅使用mmap(MAP_ANONYMOUS|MAP_STACK)映射无法增长的8MiB块. (MAP_STACK当前为空操作.)因此，分配后线程堆栈将无法增长(除非在其下方有空格的情况下使用MAP_FIXED进行手动操作)，并且不受ulimit -s unlimited的影响.

对于mmap(MAP_GROWSDOWN)，这种防止其他事物在堆栈增长区域中选择地址的魔力不存在，因此. (否则，您可能最终会占用新堆栈下方的虚拟地址空间，从而使其无法增长).只需分配完整的8MiB.另请参见其中的位置位于进程虚拟地址空间中的其他线程的堆栈?.

MAP_GROWSDOWN确实具有按需增长功能，在mmap(2)手册页中进行了说明，但是没有增长限制(除了接近现有映射)，因此(根据手册页)它基于Windows使用的保护页面，而不是主线程的堆栈.

在MAP_GROWSDOWN区域底部下方触摸多个页面的内存可能会导致段错误(与Linux的主线程堆栈不同).针对Linux的编译器不会生成堆栈探针"来确保在分配大量内存(例如，本地数组或alloca)后按顺序触摸每个4k页面，因此，这是MAP_GROWSDOWN对于堆栈而言不安全的另一个原因.

编译器确实会在Windows上发出堆栈探针.

(MAP_GROWSDOWN甚至可能根本无法工作，请参阅.永远都不是非常安全的方法，因为如果映射关系变得越来越近，则可能会产生堆栈冲突安全漏洞.只是永远不要使用MAP_GROWSDOWN.我在这里要描述Windows使用的保护页机制，因为很有趣的是，知道Linux的主线程堆栈设计并不是唯一的可能.)

I have been browsing for a while and I am trying to understand how memory is allocated to the stack when doing for example:

push rax

Or moving the stack pointer to allocate space for local variables of a subroutine:

sub rsp, X    ;Move stack pointer down by X bytes

What I understand is that the stack segment is anonymous in the virtual memory space,i.e., not file backed.

What I also understand is that the kernel will not actually map an anonymous virtual memory segment to physical memory until the program actually does something with that memory segment,i.e, write data. So, trying to read that segment before writing to it may cause an error.

In the first example the kernel will assign a frame page in physical memory if needed.In the second example I assume that the kernel will not assign any physical memory to the stack segment until the program actually writes data to an address in the stack stack segment.

Am I on the right track here?

解决方案

yes, you're on the right track here, pretty much. sub rsp, X is kind of like "lazy" allocation: the kernel only does anything after a #PF page fault exception from touching memory above the new RSP, not just modifying registers. But you can still consider the memory "allocated", i.e. safe for use.

No, read won't cause an error. Anonymous pages that have never been written are copy-on-write mapped to a/the physical zero page, whether they're in the BSS, stack, or mmap(MAP_ANONYMOUS).

Fun fact: in micro-benchmarks, make sure you touch each page of memory for input arrays, otherwise you're actually looping over the same physical 4k or 2M page of zeros repeatedly and will get L1D cache hits even though you still get TLB misses (and soft page faults)! gcc will optimize malloc+memset(0) to calloc, but std::vector will actually write all the memory whether you want it to or not. memset on global arrays is not optimized out, so that works. (Or non-zero initialized arrays will be file-backed in the data segment.)

Note, I'm leaving out the difference between mapped vs. wired. i.e. whether an access will trigger a soft/minor page fault to update the page tables, or whether it's just a TLB miss and the hardware page-table walk will find a mapping (to the zero page).

But stack memory below RSP may not be mapped at all, so touching it without moving RSP first can be an invalid page fault instead of a "minor" page fault to sort out copy-on-write.

Stack memory has an interesting twist: The stack size limit is something like 8MB (ulimit -s), but in Linux the initial stack for the first thread of a process is special. For example, I set a breakpoint in _start in a hello-world (dynamically linked) executable, and looked at /proc/<PID>/smaps for it:

7ffffffde000-7ffffffff000 rw-p 00000000 00:00 0                          [stack]
Size:                132 kB
Rss:                   8 kB
Pss:                   8 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         8 kB
Referenced:            8 kB
Anonymous:             8 kB
...

Only 8kiB of stack has been referenced and is backed by physical pages. That's expected, since the dynamic linker doesn't use a lot of stack.

Only 132kiB of stack is even mapped into the process's virtual address space. But special magic stops mmap(NULL, ...) from randomly choosing pages within the 8MiB of virtual address space that the stack could grow into.

Touching memory below the current stack mapping but within the stack limit causes the kernel to grow the stack mapping (in the page-fault handler).

(But only if rsp is adjusted first; the red-zone is only 128 bytes below rsp, so ulimit -s unlimited doesn't make touching memory 1GB below rsp grow the stack to there, but it will if you decrement rsp to there and then touch memory.)

This only applies to the initial/main thread's stack. pthreads just uses mmap(MAP_ANONYMOUS|MAP_STACK) to map an 8MiB chunk that can't grow. (MAP_STACK is currently a no-op.) So thread stacks can't grow after allocation (except manually with MAP_FIXED if there's space below them), and aren't affected by ulimit -s unlimited.

This magic preventing other things from choosing addresses in the stack-growth region doesn't exist for mmap(MAP_GROWSDOWN), so do not use it to allocate new thread stacks. (Otherwise you could end up with something using up the virtual address space below the new stack, leaving it unable to grow). Just allocate the full 8MiB. See also Where are the stacks for the other threads located in a process virtual address space?.

MAP_GROWSDOWN does have a grow-on-demand feature, described in the mmap(2) man page, but there's no growth limit (other than coming close to an existing mapping), so (according to the man page) it's based on a guard-page like Windows uses, not like the primary thread's stack.

Touching memory multiple pages below the bottom of a MAP_GROWSDOWN region might segfault (unlike with Linux's primary-thread stack). Compilers targeting Linux don't generate stack "probes" to make sure each 4k page is touched in order after a big allocation (e.g. local array or alloca), so that's another reason MAP_GROWSDOWN isn't safe for stacks.

Compilers do emit stack probes on Windows.

(MAP_GROWSDOWN might not even work at all, see @BeeOnRope's comment. It was never very safe to use for anything, because stack clash security vulnerabilities were possible if the mapping grows close to something else. So just don't use MAP_GROWSDOWN for anything ever. I'm leaving in the mention to describe the guard-page mechanism Windows uses, because it's interesting to know that Linux's primary-thread stack design isn't the only one possible.)

这篇关于使用“推"或“子" x86指令时如何分配堆栈内存?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

Unlimited

使用“推"或“子" x86指令时如何分配堆栈内存?

问题描述