问题描述
即使我只为 armv7
编译,NEON 乘法累加内部函数似乎被分解为单独的乘法和加法.
我在使用 Xcode 5 到 6 以及不同的优化设置(通过 Xcode 和直接通过命令行构建)的多个 Xcode 版本(最高至最新 4.5)中都遇到过这种情况.
例如构建和反汇编一些包含
的test.cpp
#include float32x4_t 测试( float32x4_t a, float32x4_t b, float32x4_t c ){float32x4_t 结果 = a;结果 = vmlaq_f32( 结果, b, c );返回结果;}
与
clang++ -c -O3 -arch armv7 -o "test.o" test.cppotool -arch armv7 -tv test.o
结果
test.o:(__TEXT,__text) 部分__Z4test19__simd128_float32_tS_S_:00000000 f10d0910 add.w r9, sp, #16 @ 0x1000000004 46ec mov ip, sp00000006 ecdc2b04 vldmia ip,{d18-d19}0000000a ecd90b04 vldmia r9,{d16-d17}0000000e ff420df0 vmul.f32 q8, q9, q800000012 ec432b33 vmov d19,r2,r300000016 ec410b32 vmov d18,r0,r10000001a ef400de2 vadd.f32 q8, q8, q90000001e ec510b30 vmov r0,r1,d1600000022 ec532b31 vmov r2、r3、d1700000026 4770 bx lr
而不是 vmla.f32
的预期用途.
请问我做错了什么?
要么是 bug,要么是 llvm-clang 的优化.armcc 或 gcc 会按照您的预期生成 vmla,但如果您阅读 Cortex-A 系列程序员指南 v3,它会说:
20.2.3 调度
在某些情况下,可能会有相当长的延迟,尤其是 VMLA 乘法累加(整数为五个周期;浮点数为七个周期).应优化使用这些指令的代码,以避免在结果值准备好之前尝试使用它,否则将发生停顿.尽管有几个周期导致延迟,但这些指令确实完全流水线化,所以几个操作可以立即进行.
所以 llvm-clang 将 vmla 分成乘法和累加来填充管道是有意义的.
Even though I am compiling for armv7
only, NEON multiply-accumulate intrinsics appear to be being decomposed into separate multiplies and adds.
I've experienced this with several versions of Xcode up to the latest 4.5, with iOS SDKs 5 through 6, and with different optimisation settings, both building through Xcode and through the commandline directly.
For instance, building and disassembling some test.cpp
containing
#include <arm_neon.h>
float32x4_t test( float32x4_t a, float32x4_t b, float32x4_t c )
{
float32x4_t result = a;
result = vmlaq_f32( result, b, c );
return result;
}
with
clang++ -c -O3 -arch armv7 -o "test.o" test.cpp
otool -arch armv7 -tv test.o
results in
test.o:
(__TEXT,__text) section
__Z4test19__simd128_float32_tS_S_:
00000000 f10d0910 add.w r9, sp, #16 @ 0x10
00000004 46ec mov ip, sp
00000006 ecdc2b04 vldmia ip, {d18-d19}
0000000a ecd90b04 vldmia r9, {d16-d17}
0000000e ff420df0 vmul.f32 q8, q9, q8
00000012 ec432b33 vmov d19, r2, r3
00000016 ec410b32 vmov d18, r0, r1
0000001a ef400de2 vadd.f32 q8, q8, q9
0000001e ec510b30 vmov r0, r1, d16
00000022 ec532b31 vmov r2, r3, d17
00000026 4770 bx lr
instead of the expected use of vmla.f32
.
What am I doing wrong, please?
It is either a bug or an optimization by llvm-clang. armcc or gcc produces vmla as you expect but if you read Cortex-A Series Programmer’s Guide v3, it says:
So it makes sense for llvm-clang to separate vmla into multiply and accumulate to fill the pipeline.
这篇关于在 iOS 上使用 NEON 乘法累加的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!