问题描述
我目前正在阅读这本书:计算机系统-程序员的观点".我发现,在x86-64架构上,我们限于6个整数参数,这些参数将传递给寄存器中的函数.接下来的参数将在堆栈上传递.
I'm currently reading the book: "Computer Systems - A Programmers Perspective". I've found out that, on the x86-64 architecture, we are limited to 6 integral parameters which will be passed to a function in registers. The next parameters will be passed on the stack.
而且,第一个最多8个FP或矢量args在xmm0..7中传递.
And also, the first up-to-8 FP or vector args are passed in xmm0..7.
为什么即使参数不是单精度或双精度变量,也不使用浮点寄存器来存储下一个参数?
Why not use float registers in order to store the next parameters, even when the parameters are not single/double precision variables?
(据我所知)将数据存储在寄存器中比将其存储到内存然后从内存中读取数据要有效得多.
It would be much more efficient (as far as I understood) to store the data in registers, than to store it to memory, and then read it from memory.
推荐答案
大多数函数没有超过6个整数参数,因此这确实是一个极端的情况.在xmm寄存器中传递一些多余的整数参数将使在哪里找到浮点args的规则变得更加复杂,几乎没有好处.除了它可能不会使代码更快的事实.
Most functions don't have more than 6 integer parameters, so this is really a corner case. Passing some excess integer params in xmm registers would make the rules for where to find floating point args more complicated, for little to no benefit. Besides the fact that it probably wouldn't make code any faster.
将多余的参数存储在内存中的另一个原因是,该函数可能不会立即使用.如果要调用另一个函数,则必须将这些参数从xmm寄存器中保存到内存中,因为调用的函数将破坏所有传递参数的寄存器. (而且所有xmm regs都已保存为调用者保存的.)因此,您可能最终得到将参数填充到不能直接使用的矢量寄存器中的代码,然后从那里将它们存储到内存中,然后再调用另一个函数,并且仅 then 将它们重新加载到整数寄存器中.或者即使该函数不调用其他函数,也许它也需要向量寄存器供自己使用,并且必须将参数存储到内存中以释放它们以运行向量代码!只是将push
参数设置到堆栈上会更容易,因为push
出于明显的原因进行了非常严格的优化,可以在单个uop中完成存储和RSP修改,与mov
一样便宜.
A further reason for storing excess parameters in memory is that you the function probably won't use them all right away. If you want to call another function, you have to save those parameters from xmm registers to memory, because the function you call will destroy any parameter-passing registers. (And all the xmm regs are caller-saved anyway.) So you could potentially end up with code that stuffs parameters into vector registers where they can't be used directly, and from there stores them to memory before calling another function, and only then loads them back into integer registers. Or even if the function doesn't call other functions, maybe it needs the vector registers for its own use, and would have to store params to memory to free them up for running vector code! It would have been easier just to push
params onto the stack, because push
very heavily optimized, for obvious reasons, to do the store and the modification of RSP all in a single uop, about as cheap as a mov
.
在 SysV Linux/Mac x86-64 ABI (r11).拥有一个暂存寄存器供懒惰的动态链接程序代码使用而无需保存是有用的(因为此类shim函数需要将其所有arg传递给动态加载的函数),以及类似的包装函数.
There is one integer register that is not used for parameter passing, but also not call-preserved in the SysV Linux/Mac x86-64 ABI (r11). It's useful to have a scratch register for lazy dynamic linker code to use without saving (since such shim functions need to pass on all their args to the dynamically-loaded function), and similar wrapper functions.
因此,AMD64可以为功能参数使用更多的整数寄存器,但这只是以调用函数必须在使用前保存的寄存器数量为代价. (或针对不使用静态链"指针或其他语言的语言使用两用r10.)
So AMD64 could have used more integer registers for function parameters, but only at the expense of the number of registers that called functions have to save before using. (Or dual-purpose r10 for languages that don't use a "static chain" pointer, or something.)
无论如何,在寄存器中传递更多的参数并不总是更好.
Anyway, more parameters passed in registers isn't always better.
xmm寄存器不能用作指针或索引寄存器,并且将数据从xmm寄存器移回整数寄存器可能会使周围的代码变慢,而不是加载刚刚存储的数据. (如果有任何执行资源成为瓶颈,而不是缓存未命中或分支错误预测,那么它更有可能是ALU执行单位,而不是加载/存储单位.在Intel中,将数据从xmm转移到gp寄存器需要ALU uop和AMD当前的设计.)
xmm registers can't be used as pointer or index registers, and moving data from the xmm registers back to integer registers could slow down the surrounding code more than loading data that was just stored. (If any execution resource is going to be a bottleneck, rather than cache misses or branch mispredicts, it's more likely going to be ALU execution units, not load/store units. Moving data from xmm to gp registers takes an ALU uop, in Intel and AMD's current designs.)
L1高速缓存确实非常快,并且存储->负载转发使往返内存的总延迟大约为5个周期,例如.英特尔Haswell. (inc dword [mem]
之类的指令的等待时间为6个周期,其中包括一个ALU周期.)
L1 cache is really fast, and store->load forwarding makes the total latency for a round trip to memory something like 5 cycles on e.g. Intel Haswell. (The latency of an instruction like inc dword [mem]
is 6 cycles, including the one ALU cycle.)
如果将数据从xmm寄存器转移到gp寄存器是全部操作(没有其他事情可以使ALU执行单元保持繁忙),那么可以,在Intel CPU上,movd xmm0, eax
/movd eax, xmm0
(Intel Haswell的2个周期)小于mov [mem], eax
/mov eax, [mem]
(Intel Haswell的5个周期)的延迟,但是整数代码通常不会像FP代码那样经常受到延迟的困扰.
If moving data from xmm to gp registers was all you were going to do (with nothing else to keep the ALU execution units busy), then yes, on Intel CPUs the round trip latency for movd xmm0, eax
/ movd eax, xmm0
(2 cycles Intel Haswell) is less than the latency of mov [mem], eax
/ mov eax, [mem]
(5 cycles Intel Haswell), but integer code usually isn't totally bottlenecked by latency the way FP code often is.
在两个布尔整数共享一个矢量/FP单元的AMD Bulldozer系列CPU上,直接在GP寄存器和矢量寄存器之间移动数据实际上非常慢(单向8或10个周期,或Steamroller的一半).内存往返只有8个周期.
On AMD Bulldozer-family CPUs, where two integer cores share a vector/FP unit, moving data directly between GP regs and vector regs is actually quite slow (8 or 10 cycles one way, or half that on Steamroller). A memory round trip is only 8 cycles.
32位代码也可以合理地运行. CPU非常优化,可以将参数存储到堆栈上,然后再次加载,因为旧的32位ABI仍然用于 lot 代码,尤其是.在Windows上. (大多数Linux系统大多数运行64位代码,而大多数Windows桌面系统运行很多32位代码,因为如此多的Windows程序只能作为预编译的32位二进制文件使用.)
32bit code manages to run reasonably well, even though all parameters are passed on the stack, and have to be loaded. CPUs are very highly optimized for storing parameters onto the stack and then loading them again, because the crufty old 32bit ABI is still used for a lot of code, esp. on Windows. (Most Linux systems mostly run 64bit code, while most Windows desktop systems run a lot of 32bit code because so many Windows programs are only available as pre-compiled 32bit binaries.)
有关CPU微体系结构指南的信息,请参见 http://agner.org/optimize/,以了解如何解决实际需要多少个周期. x86 Wiki的问题的其他链接也不错,包括上面链接的x86-64 ABI文档.
See http://agner.org/optimize/ for CPU microarchitecture guides to learn how to figure out how many cycles something will actually take. There are other good links in the x86 wiki, including the x86-64 ABI doc linked above.
这篇关于为什么不将功能参数存储在XMM向量寄存器中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!