问题描述
我正在编写一个代码,其中在两个地方有 64 位乘 32 位定点除法,结果采用 32 位.这两个地方加起来占了我总时间的 20% 以上.所以我觉得如果我可以去掉 64 位除法,我可以很好地优化代码.在 NEON 中,我们可以有一些 64 位指令.任何人都可以建议一些例程来通过使用一些更快的实现来解决瓶颈.
I am working on a code in which at two places there are 64bit by 32 bit fixed point division and the result is taken in 32 bits. These two places are together taking more than 20% of my total time taken. So I feel like if I could remove the 64 bit division, I could optimize the code well. In NEON we can have some 64 bit instructions. Can any one suggest some routine to get the bottleneck resolved by using some faster implementation.
或者如果我可以在 C 中按照 32 位/32 位除法来进行 64 位/32 位除法,那也可以吗?
Or if I could make the 64 bit/32 bit division in terms of 32bit/32 bit division in C, that also is fine?
如果有人有什么想法,可以请你帮帮我吗?
If any one has some idea, could you please help me out?
推荐答案
我过去做过很多定点运算,我自己也做了很多研究来寻找快速的 64/32 位除法.如果你在谷歌上搜索ARM 部门",你会发现 吨 很多关于这个问题的链接和讨论.
I did a lot of fixed-point arithmetic in the past and did a lot of research looking for fast 64/32 bit divisions myself. If you google for 'ARM division' you will find tons of great links and discussion about this issue.
ARM 架构的最佳解决方案,即使是 32 位除法也可能在硬件中不可用:
The best solution for ARM architecture, where even a 32 bit division may not be available in hardware is here:
http://www.peter-teichmann.de/adiv2e.html
这个汇编代码非常老了,你的汇编器可能不理解它的语法.然而,值得将代码移植到您的工具链中.这是迄今为止我见过的针对您的特殊情况的最快除法代码,相信我:我已经对它们进行了基准测试:-)
This assembly code is very old, and your assembler may not understand the syntax of it. It is however worth porting the code to your toolchain. It is the fastest division code for your special case I've seen so far, and trust me: I've benchmarked them all :-)
我上次这样做时(大约 5 年前,对于 CortexA8),这段代码比编译器生成的代码快了大约 10 倍.
Last time I did that (about 5 years ago, for CortexA8) this code was about 10 times faster than what the compiler generated.
此代码不使用 NEON.NEON 端口会很有趣.但不确定它是否会大大提高性能.
This code doesn't use NEON. A NEON port would be interesting. Not sure if it will improve the performance much though.
我发现将汇编程序移植到 GAS(GNU 工具链)的代码.此代码正在运行并经过测试:
I found the code with assembler ported to GAS (GNU Toolchain). This code is working and tested:
除法.S
.section ".text"
.global udiv64
udiv64:
adds r0,r0,r0
adc r1,r1,r1
.rept 31
cmp r1,r2
subcs r1,r1,r2
adcs r0,r0,r0
adc r1,r1,r1
.endr
cmp r1,r2
subcs r1,r1,r2
adcs r0,r0,r0
bx lr
C 代码:
extern "C" uint32_t udiv64 (uint32_t a, uint32_t b, uint32_t c);
int32_t fixdiv24 (int32_t a, int32_t b)
/* calculate (a<<24)/b with 64 bit immediate result */
{
int q;
int sign = (a^b) < 0; /* different signs */
uint32_t l,h;
a = a<0 ? -a:a;
b = b<0 ? -b:b;
l = (a << 24);
h = (a >> 8);
q = udiv64 (l,h,b);
if (sign) q = -q;
return q;
}
这篇关于ARM/NEON的64位/32位划分更快的算法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!