本文介绍了霓虹灯浮点乘法比预期的要慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个浮动标签.我需要将第一个选项卡中的元素与第二个选项卡中的相应元素相乘,并将结果存储在第三个选项卡中.

I have two tabs of floats. I need to multiply elements from the first tab by corresponding elements from the second tab and store the result in a third tab.

我想使用 NEON 来并行化浮点乘法:同时进行四个浮点乘法而不是一个.

I would like to use NEON to parallelize floats multiplications: four float multiplications simultaneously instead of one.

我预计会有显着的加速,但我只实现了大约 20% 的执行时间减少.这是我的代码:

I have expected significant acceleration but I achieved only about 20% execution time reduction. This is my code:

#include <stdlib.h>
#include <iostream>
#include <arm_neon.h>

const int n = 100; // table size

/* fill a tab with random floats */
void rand_tab(float *t) {
    for (int i = 0; i < n; i++)
        t[i] = (float)rand()/(float)RAND_MAX;
}

/* Multiply elements of two tabs and store results in third tab
 - STANDARD processing. */
void mul_tab_standard(float *t1, float *t2, float *tr) {
    for (int i = 0; i < n; i++)
         tr[i] = t1[i] * t2[i];
}

/* Multiply elements of two tabs and store results in third tab
- NEON processing. */
void mul_tab_neon(float *t1, float *t2, float *tr) {
    for (int i = 0; i < n; i+=4)
        vst1q_f32(tr+i, vmulq_f32(vld1q_f32(t1+i), vld1q_f32(t2+i)));
}

int main() {
    float t1[n], t2[n], tr[n];

    /* fill tables with random values */
    srand(1); rand_tab(t1); rand_tab(t2);


    // I repeat table multiplication function 1000000 times for measuring purposes:
    for (int k=0; k < 1000000; k++)
        mul_tab_standard(t1, t2, tr);  // switch to next line for comparison:
    //mul_tab_neon(t1, t2, tr);
    return 1;
}

我运行以下命令进行编译:g++ -mfpu=neon -ffast-math neon_test.cpp

I run the following command to compile: g++ -mfpu=neon -ffast-math neon_test.cpp

我的 CPU:ARMv7 处理器版本 0 (v7l)

My CPU: ARMv7 Processor rev 0 (v7l)

您对我如何实现更显着的加速有什么想法吗?

Do you have any ideas how I can achieve more significant speed-up?

推荐答案

Cortex-A8 和 Cortex-A9 每个周期只能进行两次 SP FP 乘法,因此您最多可以将这些(最流行的)CPU 的性能提高一倍.实际上,ARM CPU 的 IPC 非常低,因此最好尽可能地展开循环.如果你想要终极性能,请用汇编编写:gcc 的 ARM 代码生成器远不如 x86.

Cortex-A8 and Cortex-A9 can do only two SP FP multiplications per cycle, so you may at most double the performance on those (most popular) CPUs. In practice, ARM CPUs have very low IPC, so it is preferably to unroll the loops as much as possible. If you want ultimate performance, write in assembly: gcc's code generator for ARM is nowhere as good as for x86.

我还建议使用特定于 CPU 的优化选项:对于 Cortex-A9,-O3 -mcpu=cortex-a9 -march=armv7-a -mtune=cortex-a9 -mfpu=neon -mthumb";对于 Cortex-A15、Cortex-A8 和 Cortex-A5,相应地替换 -mcpu=-mtune=cortex-a15/a8/a5.gcc 没有针对 Qualcomm CPU 进行优化,因此对于 Qualcomm Scorpion 使用 Cortex-A8 参数(并且展开甚至比您通常做的更多),对于 Qualcomm Krait 尝试使用 Cortex-A15 参数(您将需要支持最新版本的 gcc它).

I also recommend to use CPU-specific optimization options: "-O3 -mcpu=cortex-a9 -march=armv7-a -mtune=cortex-a9 -mfpu=neon -mthumb" for Cortex-A9; for Cortex-A15, Cortex-A8 and Cortex-A5 replace -mcpu=-mtune=cortex-a15/a8/a5 accordingly. gcc does not have optimizations for Qualcomm CPUs, so for Qualcomm Scorpion use Cortex-A8 parameters (and also unroll even more than you usually do), and for Qualcomm Krait try Cortex-A15 parameters (you will need a recent version of gcc which supports it).

这篇关于霓虹灯浮点乘法比预期的要慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-30 22:54