问题描述
我正在寻找一种快速的方法来计算具有3或4个分量的向量的点积。我尝试了几个事情,但大多数示例在线使用浮点数组,而我们的数据结构不同。
I am looking for a fast way to calculate the dot product of vectors with 3 or 4 components. I tried several things, but most examples online use an array of floats while our data structure is different.
我们使用16字节对齐的结构。代码摘录(简化):
We use structs which are 16 byte aligned. Code excerpt (simplified):
struct float3 {
float x, y, z, w; // 4th component unused here
}
struct float4 {
float x, y, z, w;
}
在以前的测试(使用SSE4点产品内在或FMA)与使用以下常规c ++代码相比,这是一个加速。
In previous tests (using SSE4 dot product intrinsic or FMA) I could not get a speedup, compared to using the following regular c++ code.
float dot(const float3 a, const float3 b) {
return a.x*b.x + a.y*b.y + a.z*b.z;
}
使用英特尔Ivy Bridge / Haswell的gcc和clang进行测试。看来,将数据加载到SIMD寄存器并再次将其拉出的时间会消除所有的好处。
Tests were done with gcc and clang on Intel Ivy Bridge / Haswell. It seems that the time spend to load the data into the SIMD registers and pulling them out again kills alls the benefits.
我会感谢一些帮助和想法,产品可以使用我们的float3 / 4数据结构有效地计算。 SSE4,AVX甚至AVX2都很好。
I would appreciate some help and ideas, how the dot product can be efficiently calculated using our float3/4 data structures. SSE4, AVX or even AVX2 is fine.
提前感谢。
推荐答案
代数上,高效的SIMD看起来与标量代码几乎完全相同。所以正确的做点积的方法是一次操作四个浮动向量SEE(八与AVX)。
Algebraically, efficient SIMD looks almost identical to scalar code. So the right way to do the dot product is to operate on four float vectors at once for SEE (eight with AVX).
考虑构造你的代码这样
#include <x86intrin.h>
struct float4 {
__m128 xmm;
float4 () {};
float4 (__m128 const & x) { xmm = x; }
float4 & operator = (__m128 const & x) { xmm = x; return *this; }
float4 & load(float const * p) { xmm = _mm_loadu_ps(p); return *this; }
operator __m128() const { return xmm; }
};
static inline float4 operator + (float4 const & a, float4 const & b) {
return _mm_add_ps(a, b);
}
static inline float4 operator * (float4 const & a, float4 const & b) {
return _mm_mul_ps(a, b);
}
struct block3 {
float4 x, y, z;
};
struct block4 {
float4 x, y, z, w;
};
static inline float4 dot(block3 const & a, block3 const & b) {
return a.x*b.x + a.y*b.y + a.z*b.z;
}
static inline float4 dot(block4 const & a, block4 const & b) {
return a.x*b.x + a.y*b.y + a.z*b.z + a.w*b.w;
}
请注意,最后两个函数看起来与标量几乎相同 dot
函数,除了 float
变为 float4
和 float4
变为 block3
或 block4
。这将最有效地处理点产品。
Notice that the last two functions look almost identical to your scalar dot
function except that float
becomes float4
and float4
becomes block3
or block4
. This will do the dot product most efficiently.
这篇关于使用SSE / AVX内在函数的快速点积的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!