问题描述
我在查看intel内在指南:
,虽然他们有 _mm_dp_ps
和 _mm_dp_pd
用于计算浮点数和双精度的点积我看不到任何计算整数点积的东西。
我有两个 unsigned int [8]
数组,我想:
p(a [0] xb [0])+(a [1] * b [1])....... +(a [num_elements_in_array-1] * b [num_elements_in_array- 1])
(四个批次)并求和产品?
每次有人这样做:
temp_1 = _mm_set_epi32(x [j],x [j + 1] x [j + 2],x [j + 3]);
..小狗死了。
使用以下方法之一:
temp_1 = _mm_load_si128 // if aligned
temp_1 = _mm_loadu_si128(x); //如果未对齐
投射 x
必要。
没有整数版本的 _mm_dp_ps
。但是你可以做你要做的事:乘以4乘4整数,累加产品的总和。
这样的东西(未测试,不编译)
while(j //从x
加载4个值temp_1 = _mm_load_si128(x + j); // add cast
//从y加载4个值
temp_2 = _mm_load_si128(y + j); // add cast
j + = 4;
//乘以x [0]和y [0],x [1]和y [1]等
temp_products = _mm_mullo_epi32(temp_1,temp_2);
// Sum temp_sum
temp_sum = _mm_add_epi32(temp_sum,temp_products);
}
//获取temp_sum的水平和
temp_sum = _mm_add_epi32(temp_sum,_mm_srli_si128(temp_sum,8));
temp_sum = _mm_add_epi32(temp_sum,_mm_srli_si128(temp_sum,4));
sum = _mm_cvtsi128_si32(temp_sum);
正如在评论和聊天中所讨论的,重新排序总和的方式是最小化数字的水平总和,通过垂直执行大部分的总和。
I am looking at the intel intrinsic guide:
https://software.intel.com/sites/landingpage/IntrinsicsGuide/
and whilst they have _mm_dp_ps
and _mm_dp_pd
for calculating the dot product for floats and doubles I cannot see anything for calculating the integer dot product.
I have two unsigned int[8]
arrays and I would like to:
(a[0] x b[0]) + (a[1] * b[1])....... + (a[num_elements_in_array-1] * b[num_elements_in_array-1])
(in batches of four) and sum the products?
Every time someone does this:
temp_1 = _mm_set_epi32(x[j], x[j+1], x[j+2], x[j+3]);
.. a puppy dies.
Use one of these:
temp_1 = _mm_load_si128(x); // if aligned
temp_1 = _mm_loadu_si128(x); // if not aligned
Cast x
as necessary.
There is no integer version of _mm_dp_ps
. But you can do what you were about to do: multiply 4 by 4 integers, accumulate the sum of the products.
So something like this (not tested, doesn't compile)
while(j < num_elements_in_array){
//Load the 4 values from x
temp_1 = _mm_load_si128(x + j); // add cast
//Load the 4 values from y
temp_2 = _mm_load_si128(y + j); // add cast
j += 4;
//Multiply x[0] and y[0], x[1] and y[1] etc
temp_products = _mm_mullo_epi32(temp_1, temp_2);
//Sum temp_sum
temp_sum = _mm_add_epi32(temp_sum, temp_products);
}
// take horizontal sum of temp_sum
temp_sum = _mm_add_epi32(temp_sum, _mm_srli_si128(temp_sum, 8));
temp_sum= _mm_add_epi32(temp_sum, _mm_srli_si128(temp_sum, 4));
sum = _mm_cvtsi128_si32(temp_sum);
As discussed in the comments and chat, that reorders the sums in such a way as to minimize the number of horizontal sums required, by doing most sums vertically.
这篇关于使用SSE / AVX的整数点积?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!