问题描述
Cortex-A57优化指南指出,大多数对128位矢量数据进行操作的整数指令都可以双重发出(第24页,整数基本F0/F1,逻辑F0/F1,执行吞吐量2).
The Cortex-A57 Optimization Guide states that most integer instructions operating on 128-bit vector data can be dual-issued (Page 24, integer basic F0/F1, logical F0/F1, execution throughput 2).
但是,使用我们的内部(综合)基准测试,即使有足够的指令并行性,吞吐量似乎也仅限于1条128位氖整数指令(基准测试的目的是测试是否128位可以重复发出霓虹灯指令,因此我们非常注意).当将50%的128位指令与50%的64位指令混合使用时,我们每个时钟可以实现1.25条指令(仅氖整数域,无负载/存储).
However with our internal (synthetic) benchmarks, throughput seems to be limited to exactly 1 128-bit neon integer instruction, even when there is plenty of instruction parallelism available (the benchmark was written with the intention to test whether 128-bit neon instructions can be dual-issued, so this is something we took care). When mixing 50% 128-bit with 50% 64-bit instructions, we were able to achieve 1.25 instructions per clock (only neon integer arith, no loads/stores).
使用128位ASIMD/Neon指令时,是否需要采取特殊措施才能获得双问题吞吐量?
Are there special measures which have to be taken in order to get dual-issue throughput when using 128-bit ASIMD/Neon instructions?
Thx,克莱门斯
推荐答案
在实际代码中,并非所有指令结果都将被写入寄存器文件,而是它们将通过转发路径传递.如果您在代码中混用了相关指令和独立指令,则可能会看到更高的IPC.
In real code, not all instruction results will be written to the register file, instead they will pass through forwarding paths. If you mix dependent and independent instructions in your code, you may see higher IPC.
A57优化指南指出,乘法累加指令链发生后退,因此类似的事情可能会双重发出.
The A57 optimisation guide states that late-forwarding occurs for chains of multiply-accumulate instructions, so maybe something like this will dual-issue.
.loop
vmla.s16 q0,q0,q1
vmla.s16 q0,q0,q2
vmla.s16 q0,q0,q3
vmla.s16 q4,q4,q1
vmla.s16 q4,q4,q2
vmla.s16 q4,q4,q3
...etc
这篇关于Cortex-A57可以发出双指令128位霓虹灯指令吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!