使用SSE乘以C ++中32位整数的两个向量的最快方法

本文介绍了使用SSE乘以C ++中32位整数的两个向量的最快方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有两个无符号向量，两者都是size 4

  vector< unsigned& v1 = {2,4,6,8} 
向量< unsigned> v2 = {1，10，11，13}

现在我想把这两个向量相乘，得到一个新的

 矢量< unsigned> v_result = {2 * 1,4×10,6 * 11,8 * 13}

SSE操作使用？它是跨平台还是只有
在某些指定的平台？

添加：
如果我的目标是添加不乘法，我可以做这个超快：

  __ m128i a = _mm_set_epi32（1,2,3,4）; 
 __m128i b = _mm_set_epi32（1,2,3,4）; 
 __m128i c; 
 c = _mm_add_epi32（a，b）;

解决方案

使用集合内在函数如 _mm_set_epi32 对于所有元素都是低效的。最好使用负载内在函数。有关详情，请参阅此讨论。如果数组是16字节对齐的，您可以使用 _mm_load_si128 或 _mm_loadu_si128 （对于对齐内存，它们具有几乎相同的效率）否则使用 _mm_loadu_si128 。但是对齐的内存更有效率。要获得对齐的内存，我推荐 _mm_malloc 和 _mm_free ，或C11 aligned_alloc ，以便您可以使用正常的免费。

假设你有两个向量加载在SSE寄存器 __ m128i a 和 __ m128i b

对于SSE版本> = SSE4.1使用

  _mm_mullo_epi32（a，b）; 
  
 
 
 
 
 
  没有SSE4.1： strong> 
 
 
 此代码是从Agner Fog的（并由此答案的原作者剽窃）：
  // Vec4i operator * Vec4i const& a，Vec4i const& b）{
 // #ifdef 
 __m128i a13 = _mm_shuffle_epi32（a，0xF5）; //（ - ，a3， - ，a1）
 __m128i b13 = _mm_shuffle_epi32（b，0xF5）; //（ - ，b3， - ，b1）
 __m128i prod02 = _mm_mul_epu32（a，b）; //（ - ，a2 * b2， - ，a0 * b0）
 __m128i prod13 = _mm_mul_epu32（a13，b13）; //（ - ，a3 * b3， - ，a1 * b1）
 __m128i prod01 = _mm_unpacklo_epi32（prod02，prod13）; //（ - ， - ，a1 * b1，a0 * b0）
 __m128i prod23 = _mm_unpackhi_epi32（prod02，prod13）; //（ - ， - ，a3 * b3，a2 * b2）
 __m128i prod = _mm_unpacklo_epi64（prod01，prod23）; //（ab3，ab2，ab1，ab0）
  
 
I have two unsigned vectors, both with size 4
vector<unsigned> v1 = {2, 4, 6, 8}
vector<unsigned> v2 = {1, 10, 11, 13}
Now I want to multiply these two vectors and get a new one
vector<unsigned> v_result = {2*1, 4*10, 6*11, 8*13}
What is the SSE operation to use? Is it cross platform or onlyin some specified platforms?
Adding:If my goal is adding not multiplication, I can do this super fast:
__m128i a = _mm_set_epi32(1,2,3,4);
__m128i b = _mm_set_epi32(1,2,3,4);
__m128i c;
c = _mm_add_epi32(a,b);
 解决方案 
Using the set intrinsics such as _mm_set_epi32 for all elements is inefficient.  It's better to use the load intrinsics.  See this discussion for more on that Where does the SSE instructions outperform normal instructions . If the arrays are 16 byte aligned you can use either _mm_load_si128 or _mm_loadu_si128 (for aligned memory they have nearly the same efficiency) otherwise use _mm_loadu_si128.  But aligned memory is much more efficient.  To get aligned memory I recommend _mm_malloc and _mm_free, or C11 aligned_alloc so you can use normal free.
To answer the rest of your question, lets assume you have your two vectors loaded in SSE registers __m128i a and __m128i b 
For SSE version >=SSE4.1 use
_mm_mullo_epi32(a, b);
Without SSE4.1:
This code is copied from Agner Fog's Vector Class Library (and was plagiarized by the original author of this answer):
// Vec4i operator * (Vec4i const & a, Vec4i const & b) {
// #ifdef
__m128i a13    = _mm_shuffle_epi32(a, 0xF5);          // (-,a3,-,a1)
__m128i b13    = _mm_shuffle_epi32(b, 0xF5);          // (-,b3,-,b1)
__m128i prod02 = _mm_mul_epu32(a, b);                 // (-,a2*b2,-,a0*b0)
__m128i prod13 = _mm_mul_epu32(a13, b13);             // (-,a3*b3,-,a1*b1)
__m128i prod01 = _mm_unpacklo_epi32(prod02,prod13);   // (-,-,a1*b1,a0*b0)
__m128i prod23 = _mm_unpackhi_epi32(prod02,prod13);   // (-,-,a3*b3,a2*b2)
__m128i prod   = _mm_unpacklo_epi64(prod01,prod23);   // (ab3,ab2,ab1,ab0)
                        
这篇关于使用SSE乘以C ++中32位整数的两个向量的最快方法的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！