问题描述
Cortex-A57 优化指南指出,在 128 位向量数据上运行的大多数整数指令都可以双重发布(第 24 页,整数基本 F0/F1,逻辑 F0/F1,执行吞吐量 2).
The Cortex-A57 Optimization Guide states that most integer instructions operating on 128-bit vector data can be dual-issued (Page 24, integer basic F0/F1, logical F0/F1, execution throughput 2).
然而,对于我们的内部(综合)基准,吞吐量似乎仅限于 1 128 位霓虹整数指令,即使有大量指令并行可用(编写基准是为了测试 128 位霓虹灯指令可以双重发布,所以这是我们小心的事情).当混合 50% 128 位和 50% 64 位指令时,我们能够实现每时钟 1.25 条指令(只有霓虹整数运算,没有加载/存储).
However with our internal (synthetic) benchmarks, throughput seems to be limited to exactly 1 128-bit neon integer instruction, even when there is plenty of instruction parallelism available (the benchmark was written with the intention to test whether 128-bit neon instructions can be dual-issued, so this is something we took care). When mixing 50% 128-bit with 50% 64-bit instructions, we were able to achieve 1.25 instructions per clock (only neon integer arith, no loads/stores).
在使用 128 位 ASIMD/Neon 指令时,是否需要采取特殊措施才能获得双发吞吐量?
Are there special measures which have to be taken in order to get dual-issue throughput when using 128-bit ASIMD/Neon instructions?
谢谢,克莱门斯
推荐答案
在实际代码中,并非所有指令结果都会写入寄存器文件,而是会通过转发路径.如果您在代码中混合使用依赖指令和独立指令,您可能会看到更高的 IPC.
In real code, not all instruction results will be written to the register file, instead they will pass through forwarding paths. If you mix dependent and independent instructions in your code, you may see higher IPC.
A57 优化指南指出,对于乘法累加指令链会发生延迟转发,因此可能像这样的事情会产生双重问题.
The A57 optimisation guide states that late-forwarding occurs for chains of multiply-accumulate instructions, so maybe something like this will dual-issue.
.loop
vmla.s16 q0,q0,q1
vmla.s16 q0,q0,q2
vmla.s16 q0,q0,q3
vmla.s16 q4,q4,q1
vmla.s16 q4,q4,q2
vmla.s16 q4,q4,q3
...etc
这篇关于Cortex-A57 可以双发 128 位霓虹灯指令吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!