问题描述
我有花车的两个标签。我需要通过从第二标签相对应的元素乘以从所述第一标签元件并将结果存储在第三标签
I have two tabs of floats. I need to multiply elements from the first tab by corresponding elements from the second tab and store the result in a third tab.
我想用NEON并行彩车乘法:四个浮点乘法同时,而不是一个
I would like to use NEON to parallelize floats multiplications: four float multiplications simultaneously instead of one.
我已经预料显著加速,但我只取得了20%的执行时间减少。这是我的code:
I have expected significant acceleration but I achieved only about 20% execution time reduction. This is my code:
#include <stdlib.h>
#include <iostream>
#include <arm_neon.h>
const int n = 100; // table size
/* fill a tab with random floats */
void rand_tab(float *t) {
for (int i = 0; i < n; i++)
t[i] = (float)rand()/(float)RAND_MAX;
}
/* Multiply elements of two tabs and store results in third tab
- STANDARD processing. */
void mul_tab_standard(float *t1, float *t2, float *tr) {
for (int i = 0; i < n; i++)
tr[i] = t1[i] * t2[i];
}
/* Multiply elements of two tabs and store results in third tab
- NEON processing. */
void mul_tab_neon(float *t1, float *t2, float *tr) {
for (int i = 0; i < n; i+=4)
vst1q_f32(tr+i, vmulq_f32(vld1q_f32(t1+i), vld1q_f32(t2+i)));
}
int main() {
float t1[n], t2[n], tr[n];
/* fill tables with random values */
srand(1); rand_tab(t1); rand_tab(t2);
// I repeat table multiplication function 1000000 times for measuring purposes:
for (int k=0; k < 1000000; k++)
mul_tab_standard(t1, t2, tr); // switch to next line for comparison:
//mul_tab_neon(t1, t2, tr);
return 1;
}
我运行下面的命令来编译:
G ++ -mfpu =霓虹灯-ffast,数学neon_test.cpp
I run the following command to compile: g++ -mfpu=neon -ffast-math neon_test.cpp
我的CPU:ARMv7的处理器REV 0(v7l)
My CPU: ARMv7 Processor rev 0 (v7l)
你有什么想法如何,我可以实现更显著加速?
Do you have any ideas how I can achieve more significant speed-up?
推荐答案
的Cortex-A8和Cortex-A9能做到每个周期只有两个SP FP乘法,所以你可能在最上双的(最流行的)的CPU性能。在实践中,ARM CPU具有非常低的IPC,所以它是preferably到尽可能展开的循环。如果你想要极致的性能,编写汇编:对于ARM gcc的code发生器是无处为86好
Cortex-A8 and Cortex-A9 can do only two SP FP multiplications per cycle, so you may at most double the performance on those (most popular) CPUs. In practice, ARM CPUs have very low IPC, so it is preferably to unroll the loops as much as possible. If you want ultimate performance, write in assembly: gcc's code generator for ARM is nowhere as good as for x86.
我也建议使用专用的CPU优化选项:-O3 -mcpu =的cortex-A9 -march =的ARMv7-A -mtune =的cortex-A9 -mfpu =霓虹灯-mthumb用于Cortex-A9;用于Cortex-A15,Cortex-A8的和Cortex-A5替代-mcpu = -mtune =的Cortex-A15 / A8 / A5相应。海湾合作委员会不具有的高通CPU的优化,因此对于高通蝎子使用Cortex-A8的参数(也解开甚至比你通常做),以及高通的Krait尝试的Cortex-A15的参数(你需要一个最近的gcc版本支持它)。
I also recommend to use CPU-specific optimization options: "-O3 -mcpu=cortex-a9 -march=armv7-a -mtune=cortex-a9 -mfpu=neon -mthumb" for Cortex-A9; for Cortex-A15, Cortex-A8 and Cortex-A5 replace -mcpu=-mtune=cortex-a15/a8/a5 accordingly. gcc does not have optimizations for Qualcomm CPUs, so for Qualcomm Scorpion use Cortex-A8 parameters (and also unroll even more than you usually do), and for Qualcomm Krait try Cortex-A15 parameters (you will need a recent version of gcc which supports it).
这篇关于霓虹灯浮动乘法是比预期慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!