使用SSE / AVX的整数点积？

本文介绍了使用SSE / AVX的整数点积？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在查看intel内在指南：

，虽然他们有 _mm_dp_ps 和 _mm_dp_pd 用于计算浮点数和双精度的点积我看不到任何计算整数点积的东西。

我有两个 unsigned int [8] 数组，我想：

p（a [0] xb [0]）+（a [1] * b [1]）....... +（a [num_elements_in_array-1] * b [num_elements_in_array- 1]）

（四个批次）并求和产品？

解决方案

每次有人这样做：

  temp_1 = _mm_set_epi32（x [j]，x [j + 1] x [j + 2]，x [j + 3]）;

..小狗死了。

使用以下方法之一：

  temp_1 = _mm_load_si128 // if aligned 
 temp_1 = _mm_loadu_si128（x）; //如果未对齐

投射 x 必要。

没有整数版本的 _mm_dp_ps 。但是你可以做你要做的事：乘以4乘4整数，累加产品的总和。

这样的东西（未测试，不编译）

  while（j  //从x 
加载4个值temp_1 = _mm_load_si128（x + j）; // add cast 
 //从y加载4个值
 temp_2 = _mm_load_si128（y + j）; // add cast 
 j + = 4; 
 //乘以x [0]和y [0]，x [1]和y [1]等
 temp_products = _mm_mullo_epi32（temp_1，temp_2）; 
 // Sum temp_sum 
 temp_sum = _mm_add_epi32（temp_sum，temp_products）; 
} 
 //获取temp_sum的水平和
 temp_sum = _mm_add_epi32（temp_sum，_mm_srli_si128（temp_sum，8））; 
 temp_sum = _mm_add_epi32（temp_sum，_mm_srli_si128（temp_sum，4））; 
 sum = _mm_cvtsi128_si32（temp_sum）;

正如在评论和聊天中所讨论的，重新排序总和的方式是最小化数字的水平总和，通过垂直执行大部分的总和。

I am looking at the intel intrinsic guide:

https://software.intel.com/sites/landingpage/IntrinsicsGuide/

and whilst they have _mm_dp_ps and _mm_dp_pd for calculating the dot product for floats and doubles I cannot see anything for calculating the integer dot product.

I have two unsigned int[8] arrays and I would like to:

(a[0] x b[0]) + (a[1] * b[1])....... + (a[num_elements_in_array-1] * b[num_elements_in_array-1])

(in batches of four) and sum the products?

解决方案

Every time someone does this:

temp_1 = _mm_set_epi32(x[j], x[j+1], x[j+2], x[j+3]);

.. a puppy dies.

Use one of these:

temp_1 = _mm_load_si128(x);  // if aligned
temp_1 = _mm_loadu_si128(x); // if not aligned

Cast x as necessary.

There is no integer version of _mm_dp_ps. But you can do what you were about to do: multiply 4 by 4 integers, accumulate the sum of the products.

So something like this (not tested, doesn't compile)

while(j < num_elements_in_array){
    //Load the 4 values from x
    temp_1 = _mm_load_si128(x + j); // add cast
    //Load the 4 values from y
    temp_2 = _mm_load_si128(y + j); // add cast
    j += 4;
    //Multiply x[0] and y[0], x[1] and y[1] etc
    temp_products = _mm_mullo_epi32(temp_1, temp_2);
    //Sum temp_sum
    temp_sum = _mm_add_epi32(temp_sum, temp_products);
}
// take horizontal sum of temp_sum
temp_sum = _mm_add_epi32(temp_sum, _mm_srli_si128(temp_sum, 8));
temp_sum= _mm_add_epi32(temp_sum, _mm_srli_si128(temp_sum, 4));
sum = _mm_cvtsi128_si32(temp_sum);

As discussed in the comments and chat, that reorders the sums in such a way as to minimize the number of horizontal sums required, by doing most sums vertically.

这篇关于使用SSE / AVX的整数点积？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！