本文介绍了如何对 NEON 向量的所有车道进行 OR的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我想使用 NEON 内在函数来优化以下代码.
I want to use NEON intrinsics to optimize the following code.
uint32x4_t c1; // 4 elements, each element is 0 or 1
uint32x4_t c2; // 4 elements, each element is 0 or 1
uint8_t pack = 0; // unsigned char, for result
/* some code /*
// need optimizing
pack |= (vgetq_lane_u32(c1, 0);
pack |= (vgetq_lane_u32(c1, 1) << 1;
pack |= (vgetq_lane_u32(c1, 2) << 2;
pack |= (vgetq_lane_u32(c1, 3) << 3;
pack |= (vgetq_lane_u32(c2, 0) << 4;
pack |= (vgetq_lane_u32(c2, 1) << 5;
pack |= (vgetq_lane_u32(c2, 2) << 6;
pack |= (vgetq_lane_u32(c2, 3) << 7;
我认为需要一些内在函数来 OR 向量的所有通道.有人能给我一些提示吗?
I think need some intrinsics to OR all lanes of a vector.Could anybody give me some hints ?
推荐答案
您可以将向量中的每个元素移动单独的位数.
You can shift each element within a vector by individual amount of bits.
const int32x4_t shifter1 = {0, 1, 2, 3};
const int32x4_t shifter2 = {4, 5, 6, 7};
.
.
.
c1 = vshlq_u32(c1, shifter1);
c2 = vshlq_u32(c2, shifter2);
c1 = vorrq_u32(c1, c2);
pack |= vgetq_lane_u32(c1, 0) | vgetq_lane_u32(c1, 1) | vgetq_lane_u32(c1, 2) | vgetq_lane_u32(c1, 3);
这应该可以解决问题,最后一行取决于编译器的质量
That should do the trick, and the last line is up to the quality of your compiler
这篇关于如何对 NEON 向量的所有车道进行 OR的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!