问题描述
说%edi包含x,而我想仅使用2个连续的Leal指令以37 * x结尾,我将如何处理?
Say %edi contains x and I want to end up with 37*x using only 2 consecutive leal instructions, how would I go about this?
例如,要获得45倍,您就可以做到
For example to get 45x you would do
leal (%edi, %edi, 8), %edi
leal (%edi, %edi, 4), %eax (to be returned)
我一生都无法找出要代替8和4的数字,以使结果(%eax)为37x
I cannot for the life of me figure out what numbers to put in place of the 8 and 4 so that the result (%eax) will be 37x
推荐答案
At -O3
, gcc will emit (Godbolt compiler explorer):
int mul37(int a) { return a*37; }
leal (%rdi,%rdi,8), %eax # eax = a * 9
leal (%rdi,%rax,4), %eax # eax = a + 4*(a*9)
ret
正在使用37 = 9*4 + 1
,不破坏第一个lea
的原始a
值,因此它可以在第二个中同时使用.
That's using 37 = 9*4 + 1
, not destroying the original a
value with the first lea
so it can use both in the 2nd.
尽管您没有发现这个错误,但是您很乐于助人:最近的clang(3.8及更高版本)通常会使用2个lea
指令而不是imul
指令(例如,对于*15
),但是会漏掉此指令一种并使用:
You're in good company in not spotting this one, though: recent clang (3.8 and newer) will normally use 2 lea
instructions instead of an imul
(e.g. for *15
), but it misses this one and uses:
imull $37, %edi, %eax
ret
它确实以与5*4 + 1
相同的方式使用了gcc使用的*21
模式. (clang3.6及更早版本始终使用imul
,除非有单指令替代shl
或lea
)
It does do *21
with the same pattern as gcc uses, as 5*4 + 1
. (clang3.6 and earlier always used imul
unless there was a single-instruction alternative shl
or lea
)
ICC和MSVC也使用imul,但是它们似乎不喜欢使用2条lea
指令,因此imul
在这里是故意的".
ICC and MSVC also use imul, but they don't seem to like using 2 lea
instructions, so the imul
is "on purpose" there.
有关gcc7.2与clang5.0的各种乘数,请参见godbolt链接.尝试gcc -m32 -mtune=pentium
或什至pentium3
来查看gcc当时想要使用多少条指令,这很有趣.尽管P2/P3对于imul r, r, i
具有4个周期的延迟,所以还是有点疯狂.奔腾具有9个周期imul
,并且没有OOO来隐藏延迟,因此尽力避免出现延迟是很有意义的.
See the godbolt link for a variety of multipliers with gcc7.2 vs. clang5.0. It's interesting to try gcc -m32 -mtune=pentium
or even pentium3
to see how many more instructions gcc was wiling to use back then. Although P2/P3 has 4-cycle latency for imul r, r, i
, so that's kinda crazy. Pentium has 9 cycle imul
and no OOO to hide the latency, so it makes sense to try hard to avoid it.
mtune=silvermont
应该只愿意用一条指令替换32位imul
,因为它具有3周期延迟/1c吞吐量倍增,但是解码通常是瓶颈(根据Agner Fog, http://agner.org/optimize/).您甚至可以考虑使用imul $64, %edi, %eax
(或2的其他幂)来代替mov
/shl
,因为imul-immediate是一个复制乘积.
mtune=silvermont
should probably only be willing to replace 32-bit imul
with a single instruction, because it has 3-cycle latency / 1c throughput multiply, but decode is often the bottleneck (according to Agner Fog, http://agner.org/optimize/). You could even consider imul $64, %edi, %eax
(or other powers of 2) instead of mov
/shl
, because imul-immediate is a copy-and-multiply.
具有讽刺意味的是,gcc
错过了* 45
的情况,并使用了imul
,而clang使用了2个lea
.猜猜是时候提交一些缺少优化的错误报告了. 如果 2个LEA优于1个IMUL,则应尽可能使用它们.
Ironically, gcc
misses the * 45
case, and uses imul
, while clang uses 2 lea
s. Guess it's time to file some missed-optimization bug reports. If 2 LEAs are better than 1 IMUL, they should be used wherever possible.
较旧的clang(3.7及更高版本)将使用imul
,除非单个lea
可以解决问题.我还没有查看变更日志,看看他们是否做了基准测试来决定支持延迟而不是吞吐量.
Older clang (3.7 and older) uses imul
unless a single lea
will do the trick. I haven't looked up the changelog to see if they did benchmarks to decide to favour latency over throughput.
相关:在未使用"的值上使用LEA关于地址/指针?关于LEA为什么使用内存操作数语法和机器编码的标准答案,即使它是shift + add指令(并且在大多数现代微体系结构中都在ALU而非AGU上运行)
Related: Using LEA on values that aren't addresses / pointers? canonical answer about why LEA uses memory-operand syntax and machine encoding, even though it's a shift+add instruction (and runs on an ALU, not AGU, in most modern microarchitectures.)
这篇关于如何在x86中仅使用2条连续的leal指令将寄存器乘以37?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!