本文介绍了x86 乘以 3:IMUL 与 SHL + ADD的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 x86-64 汇编中开发了一个程序,它需要通过相同的操作多次迭代:

I developed a program in x86-64 assembly which needs to iterate many times through the same operation:

IMUL rdx, 3   # rdx is always different

但是,我需要让运行时间更快,所以我想到了对上面那一行的优化:

However, I need to make the runtime faster, so I thought of an optimization for that specific line from above:

MOV rcx, rdx
SHL rdx, 1
ADD rdx, rcx

现在我问你们:这个修改会改善程序的运行时间(更少的时钟),还是我应该坚持使用 IMUL 命令?

Now I ask you guys: would this modification improve the runtime of the program (less clocks), or should I stick with the IMUL command?

推荐答案

lea rdx, [rdx + rdx*2] 相比,两者都很糟糕,使用缩放-index 寻址模式以获得 *3 的总数,这就是为什么如果您要求编译器编译像

Both are terrible compared to lea rdx, [rdx + rdx*2], using a scaled-index addressing mode to get a total of *3, which is why compilers will always use LEA if you ask them to compile a function like

long foo(long x){ return x * 3;} (https://godbolt.org/z/6p4ynV)

LEA 是一种通过 x86 寻址模式输入任意数字的方法, 无需将结果用于加载或存储,只需将其放入寄存器即可.对不是地址/指针的值使用 LEA?

LEA is a way to feed arbitrary numbers through x86 addressing modes without using the result for a load or store, just putting it in a register. Using LEA on values that aren't addresses / pointers?

在所有现代 x86 CPU 上,LEA 是单个 uop.唯一的问题是比替代方案好多少. imul 也是 1 uop,但 mov+shl+add 是 3前端.(这适用于所有仍然相关的主流和低功耗 Intel/AMD.参见 https://agner.org/optimize/) 64 位 imul 在一些较旧的微架构上特别慢,例如 Bulldozer-family 和 Silvermont/Goldmont,或者特别是较旧的 Atom.

On all modern x86 CPUs, LEA is a single uop. The only question is how much better it is than the alternatives. imul is also 1 uop, but mov+shl+add is 3 for the front-end. (This is true across all mainstream and low-power Intel/AMD that are still relevant. See https://agner.org/optimize/) 64-bit imul is extra slow on some older microarchitectures, like Bulldozer-family and Silvermont/Goldmont, or especially older Atom.

在 AMD CPU(Bulldozer/Ryzen)上,它有一个缩放索引,所以它是一个复杂的"LEA 并且有 2 个周期延迟(相比之下,Ryzen 上的 imul 为 3,或者更糟的是 Bulldozer-family 其中 64 位 imul 较慢且未完全流水线化).在 Ryzen 上,此 LEA 仍然具有每时钟 2 次的吞吐量.

On AMD CPUs (Bulldozer/Ryzen), it has a scaled index so it's a "complex" LEA and has 2 cycle latency (vs. 3 for imul on Ryzen, or much worse on Bulldozer-family where 64-bit imul is slower and not fully pipelined). On Ryzen this LEA still has 2-per-clock throughput.

在 Intel CPU 上,它只有 2 个组件(一个 +),因此它是一个具有 1 个周期延迟的简单"LEA,并且可以运行 2 个- 每时钟吞吐量.因此与一条 shl 指令的成本大致相同,但在不同的端口上运行.

On Intel CPUs, it only has 2 components (one +), so it's a "simple" LEA with 1 cycle latency and can run with 2-per-clock throughput. So about the same cost as one shl instruction, but runs on different ports.

(或者在 Ice Lake,每时钟 4 次,因为他们将 LEA 单元添加到其他 2 个整数 ALU 端口.所以它与 Ice Lake 上的 add 完全一样便宜.)

(Or on Ice Lake, 4-per-clock since they added LEA units to the other 2 integer ALU ports. So it's exactly as cheap as one add on Ice Lake.)

你只需要 mov ;shl ;subadd 当您的乘数为 2^n +- 1 for n >3.那么值得考虑 imul 在延迟和前端吞吐量成本之间进行权衡.

You'd only want mov ; shl ; sub or add when your multiplier was 2^n +- 1 for n > 3. Then it is worth considering imul for a tradeoff between latency and front-end throughput cost.

通过移位原始寄存器,即使没有 mov-elimination(在 IvyBridge 和 Ryzen 之前)的 CPU 也可以运行具有 2 个周期延迟关键路径长度的 mov/shl/add 序列.

By shifting the original register, even CPUs without mov-elimination (before IvyBridge and Ryzen) can run your mov/shl/add sequence with 2 cycle latency critical path length.

还相关:用于比手写程序集更快地测试 Collat​​z 猜想的 C++ 代码 - 为什么? 提供了一些关于 *3 与使用 LEA 优化的问题的详细信息.

Also related: C++ code for testing the Collatz conjecture faster than hand-written assembly - why? has some details about a problem with *3 vs. optimizing with LEA.

其他相关:

x86_64:IMUL 比2x SHL + 2x ADD?

这篇关于x86 乘以 3:IMUL 与 SHL + ADD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-16 00:07