问题描述
我有两个浮动标签。我需要将第一个选项卡的元素乘以第二个选项卡的相应元素,并将结果存储在第三个选项卡中。
I have two tabs of floats. I need to multiply elements from the first tab by corresponding elements from the second tab and store the result in a third tab.
我想使用NEON来并行浮动乘法:四个浮动乘法同时,而不是一个。
I would like to use NEON to parallelize floats multiplications: four float multiplications simultaneously instead of one.
我预计显着加速,但我只实现了约20%的执行时间减少。这是我的代码:
I have expected significant acceleration but I achieved only about 20% execution time reduction. This is my code:
#include <stdlib.h>
#include <iostream>
#include <arm_neon.h>
const int n = 100; // table size
/* fill a tab with random floats */
void rand_tab(float *t) {
for (int i = 0; i < n; i++)
t[i] = (float)rand()/(float)RAND_MAX;
}
/* Multiply elements of two tabs and store results in third tab
- STANDARD processing. */
void mul_tab_standard(float *t1, float *t2, float *tr) {
for (int i = 0; i < n; i++)
tr[i] = t1[i] * t2[i];
}
/* Multiply elements of two tabs and store results in third tab
- NEON processing. */
void mul_tab_neon(float *t1, float *t2, float *tr) {
for (int i = 0; i < n; i+=4)
vst1q_f32(tr+i, vmulq_f32(vld1q_f32(t1+i), vld1q_f32(t2+i)));
}
int main() {
float t1[n], t2[n], tr[n];
/* fill tables with random values */
srand(1); rand_tab(t1); rand_tab(t2);
// I repeat table multiplication function 1000000 times for measuring purposes:
for (int k=0; k < 1000000; k++)
mul_tab_standard(t1, t2, tr); // switch to next line for comparison:
//mul_tab_neon(t1, t2, tr);
return 1;
}
我运行以下命令编译:
g ++ -mfpu = neon -ffast-math neon_test.cpp
I run the following command to compile: g++ -mfpu=neon -ffast-math neon_test.cpp
我的CPU:ARMv7 Processor rev 0(v7l)
My CPU: ARMv7 Processor rev 0 (v7l)
有什么想法如何实现更显着的加速?
Do you have any ideas how I can achieve more significant speed-up?
推荐答案
Cortex-A8和Cortex-A9只能做两个SP每个周期的FP乘法,因此您最多可以将这些(最流行的)CPU的性能提高一倍。在实践中,ARM CPU具有非常低的IPC,因此优选地尽可能多地展开循环。如果你想要最终的性能,写在程序集:gcc的代码生成器为ARM是无处不如x86。
Cortex-A8 and Cortex-A9 can do only two SP FP multiplications per cycle, so you may at most double the performance on those (most popular) CPUs. In practice, ARM CPUs have very low IPC, so it is preferably to unroll the loops as much as possible. If you want ultimate performance, write in assembly: gcc's code generator for ARM is nowhere as good as for x86.
我还建议使用CPU特定的优化选项: Cortex-A9的-O3-mcpu = cortex-a9-march = armv7-a -mtune = cortex-a9 -mfpu = neon -mthumb因此,对于Cortex-A15,Cortex-A8和Cortex-A5,替换-mcpu = -mtune = cortex-a15 / a8 / a5。 gcc没有对Qualcomm CPU进行优化,所以对于Qualcomm Scorpion使用Cortex-A8参数(并且还要展开比你通常做的更多),并且对于Qualcomm Krait尝试Cortex-A15参数(您将需要一个最新版本的gcc支持it)。
I also recommend to use CPU-specific optimization options: "-O3 -mcpu=cortex-a9 -march=armv7-a -mtune=cortex-a9 -mfpu=neon -mthumb" for Cortex-A9; for Cortex-A15, Cortex-A8 and Cortex-A5 replace -mcpu=-mtune=cortex-a15/a8/a5 accordingly. gcc does not have optimizations for Qualcomm CPUs, so for Qualcomm Scorpion use Cortex-A8 parameters (and also unroll even more than you usually do), and for Qualcomm Krait try Cortex-A15 parameters (you will need a recent version of gcc which supports it).
这篇关于霓虹浮动乘法慢于预期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!