问题描述
我正在阅读 IDA专业版.在第86页上,在讨论调用约定时,作者展示了一个cdecl调用约定的示例,该示例消除了调用程序从堆栈中清除参数的需要.我正在复制下面的代码片段:
I am reading the IDA Pro Book. On page 86 while discussing calling conventions, the author shows an example of cdecl calling convention that eliminates the need for the caller to clean arguments off the stack. I am reproducing the code snippet below:
; demo_cdecl(1, 2, 3, 4); //programmer calls demo_cdecl
mov [esp+12], 4 ; move parameter z to fourth position on stack
mov [esp+8], 3 ; move parameter y to third position on stack
mov [esp+4], 2 ; move parameter x to second position on stack
mov [esp], 1 ; move parameter w to top of stack
call demo_cdecl ; call the function
作者继续说
我将假定代码段顶部有一个sub esp, 0x10
.否则,您将破坏堆栈.
I am going to assume that there is a sub esp, 0x10
at the top of the code snippet. Otherwise, you would just be corrupting the stack.
他后来说,对demo_cdecl的调用完成时,调用方不需要调整堆栈.但是可以肯定的是,调用后必须有一个add esp, 0x10
.
He later says that the caller doesn't need to adjust the stack when call to demo_cdecl completes. But surely, there has to be a add esp, 0x10
after the call.
我到底想念什么?
推荐答案
如果已经分配了足够的空间(例如,像你建议的.)
Compilers often choose mov
to store args instead of push
, if there's enough space already allocated (e.g. with a sub esp, 0x10
earlier in the function like you suggested).
这是一个例子:
int f1(int);
int f2(int,int);
int foo(int a) {
f1(2);
f2(3,4);
return f1(a);
}
由 clang6.0 -O3 -march=haswell
在Godbolt上
compiled by clang6.0 -O3 -march=haswell
on Godbolt
sub esp, 12 # reserve space to realign stack by 16
mov dword ptr [esp], 2 # store arg
call f1(int)
# reuse the same arg-passing space for the next function
mov dword ptr [esp + 4], 4
mov dword ptr [esp], 3
call f2(int, int)
add esp, 12
# now ESP is pointing to our own arg
jmp f1(int) # TAILCALL
使用sub esp,8
/push 2
时,
clang的代码生成会更好,但随后其余功能保持不变.即让push
增大堆栈,因为它的代码大小小于mov
,尤其是mov
即时,并且性能也不会变差(因为我们将要同时使用堆栈引擎的call
).参见什么C/C ++编译器可以使用推式弹出指令来创建局部变量,而不仅仅是增加esp一次?以获取更多详细信息.
clang's code-gen would have been even better with sub esp,8
/ push 2
, but then the rest of the function unchanged. i.e. let push
grow the stack because it has smaller code-size that mov
, especially mov
-immediate, and performance is not worse (because we're about to call
which also uses the stack engine). See What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once? for more details.
我还在Godbolt链接的GCC输出中添加了/不带 -maccumulate-outgoing-args
,它推迟清除堆栈,直到函数结束..
I also included in the Godbolt link GCC output with/without -maccumulate-outgoing-args
that defers clearing the stack until the end of the function..
默认情况下(不累积传出的args),gcc会让ESP反弹,甚至使用2x pop
从堆栈中清除2个args. (避免堆栈同步uop,以2次在L1d缓存中击中无用的负载为代价).要清除3个或更多的args,gcc使用add esp, 4*N
.我怀疑用mov
存储区重用arg传递空间而不是添加esp/push有时会提高整体性能,尤其是用寄存器而不是立即数. (push imm8
比mov imm32
小得多.)
By default (without accumulate outgoing args) gcc does let ESP bounce around, and even uses 2x pop
to clear 2 args from the stack. (Avoiding a stack-sync uop, at the cost of 2 useless loads that hit in L1d cache). With 3 or more args to clear, gcc uses add esp, 4*N
. I suspect that reusing the arg-passing space with mov
stores instead of add esp / push would be a win sometimes for overall performance, especially with registers instead of immediates. (push imm8
is much more compact than mov imm32
.)
foo(int): # gcc7.3 -O3 -m32 output
push ebx
sub esp, 20
mov ebx, DWORD PTR [esp+28] # load the arg even though we never need it in a register
push 2 # first function arg
call f1(int)
pop eax
pop edx # clear the stack
push 4
push 3 # and write the next two args
call f2(int, int)
mov DWORD PTR [esp+32], ebx # store `a` back where we it already was
add esp, 24
pop ebx
jmp f1(int) # and tailcall
使用-maccumulate-outgoing-args
时,输出基本上类似于clang,但是gcc在进行尾调用之前仍会保存/恢复ebx
并将a
保留在其中.
With -maccumulate-outgoing-args
, the output is basically like clang, but gcc still save/restores ebx
and keeps a
in it, before doing a tailcall.
请注意,使ESP反弹需要.eh_frame
中的额外元数据来展开堆栈. Jan Hubicka在2014年写道:
Note that having ESP bounce around requires extra metadata in .eh_frame
for stack unwinding. Jan Hubicka writes in 2014:
因此,使用push args可以节省4%的代码大小(以字节为单位;对于L1i缓存占用空间很重要),并且每个call
之后至少通常将它们从堆栈中清除.我认为这里有一个快乐的媒介,就是gcc可以使用更多的push
而不使用 just push
/pop
.
So a 4% code-size saving (in bytes; matters for L1i cache footprint) from using push for args and at least typically clearing them off the stack after each call
. I think there's a happy medium here that gcc could use more push
without using just push
/pop
.
在call
之前保持16字节堆栈对齐会产生混淆的影响,这是当前版本的i386 System V ABI所要求的.在32位模式下,它过去只是gcc的默认值,用于维护-mpreferred-stack-boundary=4
. (即1 << 4).我想你仍然可以使用-mpreferred-stack-boundary=2
违反ABI并编写仅关心ESP的4B对齐的代码.
There's a confounding effect of maintaining 16-byte stack alignment before call
, which is required by the current version of the i386 System V ABI. In 32-bit mode, it used to just be a gcc default to maintain -mpreferred-stack-boundary=4
. (i.e. 1<<4). I think you can still use-mpreferred-stack-boundary=2
to violate the ABI and make code that only cares about 4B alignment for ESP.
我没有在Godbolt上尝试过,但是可以.
I didn't try this on Godbolt, but you could.
这篇关于无法理解调用者不需要清理堆栈的cdecl调用约定的示例的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!