问题描述
我在 x86-64 汇编中开发了一个程序,它需要通过相同的操作多次迭代:
I developed a program in x86-64 assembly which needs to iterate many times through the same operation:
IMUL rdx, 3 # rdx is always different
但是,我需要让运行时间更快,所以我想到了对上面那一行的优化:
However, I need to make the runtime faster, so I thought of an optimization for that specific line from above:
MOV rcx, rdx
SHL rdx, 1
ADD rdx, rcx
现在我问你们:这个修改会改善程序的运行时间(更少的时钟),还是我应该坚持使用 IMUL
命令?
Now I ask you guys: would this modification improve the runtime of the program (less clocks), or should I stick with the IMUL
command?
推荐答案
与 lea rdx, [rdx + rdx*2]
相比,两者都很糟糕,使用缩放-index 寻址模式以获得 *3
的总数,这就是为什么如果您要求编译器编译像
Both are terrible compared to lea rdx, [rdx + rdx*2]
, using a scaled-index addressing mode to get a total of *3
, which is why compilers will always use LEA if you ask them to compile a function like
long foo(long x){ return x * 3;}
(https://godbolt.org/z/6p4ynV)
LEA 是一种通过 x86 寻址模式输入任意数字的方法, 无需将结果用于加载或存储,只需将其放入寄存器即可.对不是地址/指针的值使用 LEA?
LEA is a way to feed arbitrary numbers through x86 addressing modes without using the result for a load or store, just putting it in a register. Using LEA on values that aren't addresses / pointers?
在所有现代 x86 CPU 上,LEA 是单个 uop.唯一的问题是比替代方案好多少. imul
也是 1 uop,但 mov+shl+add 是 3前端.(这适用于所有仍然相关的主流和低功耗 Intel/AMD.参见 https://agner.org/optimize/) 64 位 imul
在一些较旧的微架构上特别慢,例如 Bulldozer-family 和 Silvermont/Goldmont,或者特别是较旧的 Atom.
On all modern x86 CPUs, LEA is a single uop. The only question is how much better it is than the alternatives. imul
is also 1 uop, but mov+shl+add is 3 for the front-end. (This is true across all mainstream and low-power Intel/AMD that are still relevant. See https://agner.org/optimize/) 64-bit imul
is extra slow on some older microarchitectures, like Bulldozer-family and Silvermont/Goldmont, or especially older Atom.
在 AMD CPU(Bulldozer/Ryzen)上,它有一个缩放索引,所以它是一个复杂的"LEA 并且有 2 个周期延迟(相比之下,Ryzen 上的 imul
为 3,或者更糟的是 Bulldozer-family 其中 64 位 imul
较慢且未完全流水线化).在 Ryzen 上,此 LEA 仍然具有每时钟 2 次的吞吐量.
On AMD CPUs (Bulldozer/Ryzen), it has a scaled index so it's a "complex" LEA and has 2 cycle latency (vs. 3 for imul
on Ryzen, or much worse on Bulldozer-family where 64-bit imul
is slower and not fully pipelined). On Ryzen this LEA still has 2-per-clock throughput.
在 Intel CPU 上,它只有 2 个组件(一个 +
),因此它是一个具有 1 个周期延迟的简单"LEA,并且可以运行 2 个- 每时钟吞吐量.因此与一条 shl
指令的成本大致相同,但在不同的端口上运行.
On Intel CPUs, it only has 2 components (one +
), so it's a "simple" LEA with 1 cycle latency and can run with 2-per-clock throughput. So about the same cost as one shl
instruction, but runs on different ports.
(或者在 Ice Lake,每时钟 4 次,因为他们将 LEA 单元添加到其他 2 个整数 ALU 端口.所以它与 Ice Lake 上的 add
完全一样便宜.)
(Or on Ice Lake, 4-per-clock since they added LEA units to the other 2 integer ALU ports. So it's exactly as cheap as one add
on Ice Lake.)
你只需要 mov
;shl
;sub
或 add
当您的乘数为 2^n +- 1 for n >3
.那么值得考虑 imul
在延迟和前端吞吐量成本之间进行权衡.
You'd only want mov
; shl
; sub
or add
when your multiplier was 2^n +- 1 for n > 3
. Then it is worth considering imul
for a tradeoff between latency and front-end throughput cost.
通过移位原始寄存器,即使没有 mov
-elimination(在 IvyBridge 和 Ryzen 之前)的 CPU 也可以运行具有 2 个周期延迟关键路径长度的 mov/shl/add 序列.
By shifting the original register, even CPUs without mov
-elimination (before IvyBridge and Ryzen) can run your mov/shl/add sequence with 2 cycle latency critical path length.
还相关:用于比手写程序集更快地测试 Collatz 猜想的 C++ 代码 - 为什么? 提供了一些关于 *3
与使用 LEA 优化的问题的详细信息.
Also related: C++ code for testing the Collatz conjecture faster than hand-written assembly - why? has some details about a problem with *3
vs. optimizing with LEA.
其他相关:
这篇关于x86 乘以 3:IMUL 与 SHL + ADD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!