问题描述
我对 copy-on-write 的理解是每个人都有一个相同数据的单一共享副本,直到它被写入,然后再复制一份".
My understanding of copy-on-write is that "Everyone has a single, shared copy of the same data until it's written, and then a copy is made".
- 相同数据的共享副本是由堆和 bss 段组成还是仅由堆组成?
- 将共享哪些内存段,这是否取决于操作系统?
推荐答案
操作系统可以设置它希望的任何写入时复制"策略,但通常,它们都做同样的事情(即最有意义的事情).
The OS can set whatever "copy on write" policy it wishes, but generally, they all do the same thing (i.e. what makes the most sense).
简单地说,对于一个类似 POSIX 的系统(linux、BSD、OSX),有四个感兴趣的区域(你所说的段):data
(其中 int x = 1;
去)、bss
(int y
去的地方)、sbrk
(这是堆/malloc)和 堆栈
Loosely, for a POSIX-like system (linux, BSD, OSX), there are four areas (what you were calling segments) of interest: data
(where int x = 1;
goes), bss
(where int y
goes), sbrk
(this is heap/malloc), and stack
当 fork
完成后,操作系统会为共享父级所有页面的子级设置一个新的页面映射.然后,在父和子的页面映射中,所有页面都被标记为只读.
When a fork
is done, the OS sets up a new page map for the child that shares all the pages of the parent. Then, in the page maps of the parent and the child, all the pages are marked readonly.
每个页面映射还有一个引用计数,指示有多少进程共享该页面.在 fork 之前,refcount 为 1,之后为 2.
Each page map also has a reference count that indicates how many processes are sharing the page. Before the fork, the refcount will be 1 and, after, it will be 2.
现在,当 either 进程尝试写入 R/O 页面时,它会出现页面错误.操作系统将看到这是写时复制",将为进程创建一个私有页面,从共享中复制数据,将该页面标记为该进程的可写页面并恢复它.
Now, when either process tries to write to a R/O page, it will get a page fault. The OS will see that this is for "copy on write", will create a private page for the process, copy in the data from the shared, mark the page as writable for that process and resume it.
它也会降低引用计数.如果 refcount 现在 [再次] 1,操作系统会将 other 进程中的页面标记为可写且非共享的 [这消除了另一个进程中的第二个页面错误 - 加速只是因为此时操作系统知道另一个进程应该可以再次不受干扰地自由写入].这种加速可能取决于操作系统.
It will also bump down the refcount. If the refcount is now [again] 1, the OS will mark the page in the other process as writable and non-shared [this eliminates a second page fault in the other process--a speedup only because at this point the OS knows that the other process should be free to write unmolested again]. This speedup could be OS dependent.
实际上,bss
部分甚至得到了更多 特殊处理.在它的初始页面映射中,所有页面都映射到一个包含全零的单个页面(也称为零页面").该映射标记为 R/O.因此,bss
区域的大小可能是千兆字节,并且只占用一个物理页面.这个单一的、特殊的、零页面在 all 进程的 all bss
部分之间共享,无论它们是否有 any完全没有关系.
Actually, the bss
section get even more special treatment. In the initial page mapping for it, all pages are mapped to a single page that contains all zeroes (aka the "zero page"). The mapping is marked R/O. So, the bss
area could be gigabytes in size and it will only occupy a single physical page. This single, special, zero page is shared amongst all bss
sections of all processes, regardless whether they have any relationship to one another at all.
因此,一个进程可以从该区域中的任何页面读取并获得它所期望的:零.只有当进程试图写入这样的页面时,相同的写时复制机制才会启动,进程获得一个私有页面,调整映射,然后恢复进程.现在可以随意写入页面了.
Thus, a process can read from any page in the area and gets what it expects: zero. It's only when the process tries to write to such a page, the same copy on write mechanism kicks in, the process gets a private page, the mapping is adjusted, and the process is resumed. It is now free to write to the page as it sees fit.
再一次,操作系统可以选择它的策略.例如,在分叉之后,共享大部分堆栈页面可能更有效,但从当前"页面的私有副本开始,由堆栈指针寄存器的值决定.
Once again, an OS can choose its policy. For example, after the fork, it might be more efficient to share most of the stack pages, but start off with private copies of the "current" page, as determined by the value of the stack pointer register.
当一个 exec
系统调用 [在子节点上] 完成时,内核必须撤消在 fork
[bumping down refcounts] 期间完成的大部分映射,释放子的映射等并恢复父的原始页面保护(即它将不再共享其页面,除非它执行另一个 fork
)
When an exec
syscall is done [on the child], the kernel has to undo much of the mapping done during the fork
[bumping down refcounts], releasing the child's mapping, etc and restoring the parent's original page protections (i.e. it will no longer be sharing its pages unless it does another fork
)
虽然不是您最初问题的一部分,但您可能会感兴趣的相关活动,例如 on demand loading [of pages] 和 on demand linking [of符号] 在 exec
系统调用之后.
Although not part of your original question, there are related activities that may be of interest, such as on demand loading [of pages] and on demand linking [of symbols] after an exec
syscall.
当一个进程执行 exec
时,内核执行上述清理,并读取一小部分可执行文件以确定其对象格式.主要格式是 ELF,但任何内核可以理解的格式都可以使用(例如 OSX 可以使用 ELF [IIRC],但它也有其他格式).
When a process does an exec
, the kernel does the cleanup above, and reads a small portion of the executable file to determine its object format. The dominate format is ELF, but any format that a kernel understands can be used (e.g. OSX can use ELF [IIRC], but it also has others].
对于 ELF,可执行文件有一个特殊部分,它为所谓的ELF 解释器"提供完整的 FS 路径,这是一个共享库,通常是 /lib64/ld.linux.so.
For ELF, the executable has a special section that gives a full FS path to what's known as the "ELF interpreter", which is a shared library, and is usually
/lib64/ld.linux.so
.
内核使用
mmap
的内部形式将其映射到应用程序空间,并为可执行文件本身设置映射.大多数东西都被标记为 R/O 页面和不存在".
The kernel, using an internal form of
mmap
, will map this into the application space, and set up a mapping for the executable file itself. Most things are marked as R/O pages and "not present".
在我们进一步讨论之前,我们需要谈谈页面的后备存储".也就是说,如果发生页面错误,我们需要从磁盘加载页面,它来自哪里.对于 heap/malloc,这通常是交换磁盘 [aka paging disk].
Before we go further, we need to talk about the "backing store" for a page. That is, if a page fault occurs and we need to load the page from disk, where it comes from. For heap/malloc, this is generally the swap disk [aka paging disk].
在linux下,一般是安装系统时添加的linux swap"类型的分区.当一个页面被写入必须刷新到磁盘以释放一些物理内存时,它会被写入那里.请注意,第一节中的页面共享算法仍然适用.
Under linux, it's generally the partition that is of the type "linux swap" that was added when the system was installed. When a page is written to that has to flushed to disk to free up some physical memory, it gets written there. Note that the page sharing algorithm in the first section still applies.
无论如何,当一个可执行文件第一次映射到内存中时,它的后备存储就是文件系统中的可执行文件.
Anyway, when an executable is first mapped into memory, its backing store is the executable file in the filesystem.
因此,内核将应用程序的程序计数器设置为指向 ELF 解释器的起始位置,并将控制权转移给它.
So, the kernel sets the app's program counter to point to the starting location of the ELF interpreter, and transfers control to it.
ELF 解释器执行它的业务.每次它尝试执行自身的一部分[一个代码"页面],它映射但没有加载,就会发生页面错误并从后台加载该页面存储(例如 ELF 解释器的文件)并将映射更改为 R/O 但存在.
The ELF interpreter goes about its business. Every time it tries to execute a portion of itself [a "code" page] that is mapped but not loaded, a page fault occurs and the loads that page from the backing store (e.g. the ELF interpreter's file) and changes the mapping to R/O but present.
ELF 解释器、共享库和可执行文件本身都会出现这种情况.
This occurs for the ELF interpreter, shared libraries, and the executable itself.
ELF 解释器现在将使用
mmap
将 libc
映射到应用程序空间 [同样,取决于需求加载].如果 ELF 解释器必须修改代码页以重定位符号 [或尝试写入任何以该文件作为后备存储的文件,例如 data
页],则会发生保护错误,内核将页面的后备存储从磁盘上的 文件 更改为交换磁盘上的页面,调整保护并恢复应用程序.
The ELF interpreter will now use
mmap
to map libc
into the app space [again, subject to the demand loading]. If the ELF interpreter has to modify a code page to relocate a symbol [or tries to write to any that has the file as the backing store, like a data
page], a protection fault occurs, the kernel changes the backing store for the page from the on disk file to a page on the swap disk, adjusts the protections, and resumes the app.
内核还必须处理 ELF 解释器(例如)试图写入 [say] 一个从未加载过的
data
页面的情况(即它必须先加载它然后然后将后备存储更改为交换磁盘)
The kernel must also handle the case where the ELF interpreter (e.g.) is trying to write to [say] a
data
page that had never yet been loaded (i.e. it has to load it first and then change the backing store to the swap disk)
ELF 解释器然后使用
libc
的部分来帮助它完成初始链接活动.它重新定位了允许其完成工作所需的最低限度.
The ELF interpreter then uses portions of
libc
to help it complete initial linking activities. It relocates the minimum necessary to allow it to do its job.
但是,ELF 解释器不会重新定位大多数其他共享库的所有符号附近的任何位置.它将查看可执行文件,并再次使用
mmap
,为可执行文件所需的共享库创建一个映射(即,当您执行 ldd 可执行文件
).
However, the ELF interpreter does not relocate anywhere near all the symbols for most other shared libraries. It will look through the executable and, again using
mmap
, create a mapping for the shared libraries the executable needs (i.e. what you see when you do ldd executable
).
这些到共享库和可执行文件的映射,可以被认为是段".
These mappings to shared libraries and executables, can be thought of as "segments".
每个共享库中都有一个指向解释器的符号跳转表.但是,ELF 解释器所做的更改很少.
There is a symbol jump table that points back to the interpreter in each shared library. But, the ELF interpreter makes minimal changes.
[注意:这是一个松散的解释]仅当应用程序尝试调用给定函数的跳转条目时[这是 GOT 等.人.你可能已经看到的东西] 是否会发生搬迁.跳转条目将控制权转移给解释器,解释器定位符号的真实地址并调整 GOT,使其现在直接指向符号的最终地址并重做调用,现在将调用真正的功能.在随后调用相同的给定函数时,它现在直接进行.
[Note: this is a loose explanation] Only when the application tries to call a given function's jump entry [this is that GOT et. al. stuff you may have seen] does a relocation occur. The jump entry transfers control to the interpreter, which locates the real address of the symbol and adjusts the GOT so that it now points directly to the final address for the symbol and redoes the call, which will now call the real function. On a subsequent call to the same given function, it now goes direct.
这称为按需链接".
所有这些
mmap
活动的副产品是经典的 sbrk
系统调用几乎没有用处.它很快就会与其中一个共享库内存映射发生冲突.
A by-product of all this
mmap
activity is the the classical sbrk
syscall is of little to no use. It would soon collide with one of the shared library memory mappings.
所以,现代
libc
不使用它.当 malloc
需要操作系统提供更多内存时,它会从匿名 mmap
请求更多内存,并跟踪哪些分配属于哪个 mmap
映射.(即,如果释放了足够的内存来组成整个映射,free
可以执行 munmap
).
So, modern
libc
doesn't use it. When malloc
needs more memory from the OS, it requests more memory from an anonymous mmap
and keeps track of which allocations belong to which mmap
mapping. (i.e. if enough memory got freed to comprise an entire mapping, free
could do an munmap
).
因此,总而言之,我们同时进行了写入时复制"、按需加载"和按需链接".它看起来很复杂,但让
fork
和 exec
快速、顺利地进行.这增加了一些复杂性,但仅在需要时(按需")才完成额外的开销.
So, to sum up, we have "copy on write", "on demand loading", and "on demand linking" all going on at the same time. It seems complex, but makes
fork
and exec
go quickly, smoothly. This adds some complexity, but extra overhead is done only when needed ("on demand").
因此,开销活动不会在程序开始启动时出现大幅波动/延迟,而是根据需要在程序的生命周期内分散开来.
Thus, instead of a large lurch/delay at the beginning launch of a program, the overhead activity gets spread out over the lifetime of the program, as needed.
这篇关于哪些段受写时复制的影响?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!