问题描述
我正在查看一些汇编代码,我看到了:
I was going over some Assembly code and I saw this:
mov r12, _read_loopr
jmp _bzero
_read_loopr:
...
_bzero:
inc r8
mov byte [r8+r15], 0x0
cmp r8, 0xff
jle _bzero
jmp r12
我想知道这样做是否有什么特别的优势(将_read_loopr移至该函数的寄存器jmp,然后再返回jmp),而不是通常的_bzero和ret调用?
And I was wondering if there was any particular advantage to doing this (mov _read_loopr to a register jmp to the function and then jmp back) rather than the usual call _bzero and ret?
推荐答案
这看上去就像是死脑代码,特别是如果返回地址标签始终位于jmp _bzero
之后,就像您在评论中所说的那样.
This just looks like braindead code, especially if the return-address label is always right after the jmp _bzero
like you say in your comment.
作者也许认为他们不能使用call
因为函数调用了缓冲寄存器".如果要调用不属于同一代码库的函数,则必须根据调用约定假定这一点.但是您可以call
/ret
使用自定义调用约定的功能.
Maybe the author thought that they couldn't use call
"because function calls clobber registers". This what you have to assume based on the calling convention if you're calling a function that isn't part of the same codebase. But you can call
/ret
to functions with custom calling conventions.
当然,对于这么小的代码,应该将其内联(即,使其成为宏而不是函数).
Of course, for code this small, it should have been inlined (i.e. make it a macro, not a function).
更重要的是,通常可以实现比一次存储一个字节更聪明的事情,如果有多个字节为零,则可能有可能导致分支预测错误.如果始终至少需要将8个字节(或更好的是16个字节)的数据清零,则可以使用宽存储来做到这一点.使最终存储写入要清零的缓冲区的最后一个字节,这可能与前一个存储重叠. (这比以分支机构结束来决定最终的4B商店,2B商店和1B商店要好得多.)请参见 x86 标签Wiki的问题,以获取有关编写高效asm的资源.
More importantly, something more clever than storing one byte at a time is normally possible, and probably worth a potential branch mispredict if there are more than a few bytes to zero. If at least 8 (or better, 16) bytes of data always need to be zeroed, you can do it with wide stores. Make the final store write the the last byte of the buffer to be zeroed, potentially overlapping with the previous store. (This is much better than ending with branches to decide to do a final 4B store, 2B store, and 1B store.) See the x86 tag wiki for resources about writing efficient asm.
如果返回地址不是在jmp _bzero
之后的其他位置,则最糟糕的情况可能是push _read_loopr
/jmp _bzero
和_bzero
中的ret
.这会破坏返回地址预测变量堆栈,导致调用树的下一个〜15 ret
发生错误的预测.
If the return address was somewhere other than right after the jmp _bzero
, then the worst possible thing would probably be push _read_loopr
/ jmp _bzero
, and ret
in _bzero
. That would break the return-address predictor stack, leading to a mispredict on the next ~15 ret
s up the call tree.
最好是内联循环,并在其后直接添加一个jmp
.
Best would be to inline the loop and put a direct jmp
after it.
我不确定将_bzero
的地址传递到jmp
的方式如何与call
/ret
和call
之后的jmp
进行比较.
I'm not sure how passing an address for _bzero
to jmp
to would compare with a call
/ret
and then a jmp
after the call
.
call
/ret
相当便宜,但是在Intel上不是单uup指令.如果只有一个呼叫者,则jmp _bzero
/jmp _read_loopr
会更好.
call
/ret
are fairly cheap, but not single-uop instructions on Intel. A jmp _bzero
/ jmp _read_loopr
would be better if there was only one caller.
这篇关于mov&跳到&跳回vs通话&退回的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!