问题描述
我有两个无符号向量,两者都是size 4
vector< unsigned& v1 = {2,4,6,8}
向量< unsigned> v2 = {1,10,11,13}
现在我想把这两个向量相乘,得到一个新的
矢量< unsigned> v_result = {2 * 1,4×10,6 * 11,8 * 13}
SSE操作使用?它是跨平台还是只有
在某些指定的平台?
添加:
如果我的目标是添加不乘法,我可以做这个超快:
__ m128i a = _mm_set_epi32(1,2,3,4);
__m128i b = _mm_set_epi32(1,2,3,4);
__m128i c;
c = _mm_add_epi32(a,b);
使用集合内在函数如 _mm_set_epi32
对于所有元素都是低效的。最好使用负载内在函数。有关详情,请参阅此讨论。如果数组是16字节对齐的,您可以使用 _mm_load_si128
或 _mm_loadu_si128
(对于对齐内存,它们具有几乎相同的效率)否则使用 _mm_loadu_si128
。但是对齐的内存更有效率。要获得对齐的内存,我推荐 _mm_malloc
和 _mm_free
,或C11 aligned_alloc
,以便您可以使用正常的免费
。
假设你有两个向量加载在SSE寄存器 __ m128i a
和 __ m128i b
对于SSE版本> = SSE4.1使用
_mm_mullo_epi32(a,b);
没有SSE4.1: strong>
此代码是从Agner Fog的(并由此答案的原作者剽窃):
// Vec4i operator * Vec4i const& a,Vec4i const& b){
// #ifdef
__m128i a13 = _mm_shuffle_epi32(a,0xF5); //( - ,a3, - ,a1)
__m128i b13 = _mm_shuffle_epi32(b,0xF5); //( - ,b3, - ,b1)
__m128i prod02 = _mm_mul_epu32(a,b); //( - ,a2 * b2, - ,a0 * b0)
__m128i prod13 = _mm_mul_epu32(a13,b13); //( - ,a3 * b3, - ,a1 * b1)
__m128i prod01 = _mm_unpacklo_epi32(prod02,prod13); //( - , - ,a1 * b1,a0 * b0)
__m128i prod23 = _mm_unpackhi_epi32(prod02,prod13); //( - , - ,a3 * b3,a2 * b2)
__m128i prod = _mm_unpacklo_epi64(prod01,prod23); //(ab3,ab2,ab1,ab0)
I have two unsigned vectors, both with size 4
vector<unsigned> v1 = {2, 4, 6, 8} vector<unsigned> v2 = {1, 10, 11, 13}
Now I want to multiply these two vectors and get a new one
vector<unsigned> v_result = {2*1, 4*10, 6*11, 8*13}
What is the SSE operation to use? Is it cross platform or onlyin some specified platforms?
Adding:If my goal is adding not multiplication, I can do this super fast:
__m128i a = _mm_set_epi32(1,2,3,4); __m128i b = _mm_set_epi32(1,2,3,4); __m128i c; c = _mm_add_epi32(a,b);
解决方案Using the set intrinsics such as
_mm_set_epi32
for all elements is inefficient. It's better to use the load intrinsics. See this discussion for more on that Where does the SSE instructions outperform normal instructions . If the arrays are 16 byte aligned you can use either_mm_load_si128
or_mm_loadu_si128
(for aligned memory they have nearly the same efficiency) otherwise use_mm_loadu_si128
. But aligned memory is much more efficient. To get aligned memory I recommend_mm_malloc
and_mm_free
, or C11aligned_alloc
so you can use normalfree
.To answer the rest of your question, lets assume you have your two vectors loaded in SSE registers
__m128i a
and__m128i b
For SSE version >=SSE4.1 use
_mm_mullo_epi32(a, b);
Without SSE4.1:
This code is copied from Agner Fog's Vector Class Library (and was plagiarized by the original author of this answer):
// Vec4i operator * (Vec4i const & a, Vec4i const & b) { // #ifdef __m128i a13 = _mm_shuffle_epi32(a, 0xF5); // (-,a3,-,a1) __m128i b13 = _mm_shuffle_epi32(b, 0xF5); // (-,b3,-,b1) __m128i prod02 = _mm_mul_epu32(a, b); // (-,a2*b2,-,a0*b0) __m128i prod13 = _mm_mul_epu32(a13, b13); // (-,a3*b3,-,a1*b1) __m128i prod01 = _mm_unpacklo_epi32(prod02,prod13); // (-,-,a1*b1,a0*b0) __m128i prod23 = _mm_unpackhi_epi32(prod02,prod13); // (-,-,a3*b3,a2*b2) __m128i prod = _mm_unpacklo_epi64(prod01,prod23); // (ab3,ab2,ab1,ab0)
这篇关于使用SSE乘以C ++中32位整数的两个向量的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!