本文介绍了使用sse2内部函数进行循环展开的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我的sse2代码又长又慢,如何快速? _mm_store_si128()失败,但是_mm_storeu_si128()被接受,为什么?
My sse2 code is long and slow, how can I make it fast? _mm_store_si128() failed but _mm_storeu_si128()accepted, why?
void tom::add(void* ptr)
{
__declspec(align(16))short* b =(short*)ptr;
int j;
#if cplusplus
for(j = 0; j < 4; j++)
{
/// 1st stage transform.
int x0 = (int)(b[j] + b[j+12]);
int x3 = (int)(b[j] - b[j+12]);
int x1 = (int)(b[j+4] + b[j+8]);
int x2 = (int)(b[j+4] - b[j+8]);
/// 2nd stage transform.
b[j] = (short)(x0 + x1);
b[j+8] = (short)(x0 - x1);
b[j+4] = (short)(x2 + (x3 << 1));
b[j+12] = (short)(x3 - (x2 << 1));
}//end for j...
#else
__m128i f0,f1,f2,f3;
j=0;
f0 = _mm_set_epi32(b[j+3],b[j+2],b[j+1],b[j]);
f1 = _mm_set_epi32(b[j+7],b[j+6],b[j+5],b[j+4]);
f2 = _mm_set_epi32(b[j+11],b[j+10],b[j+9],b[j+8]);
f3 = _mm_set_epi32(b[j+15],b[j+14],b[j+13],b[j+12]);
__declspec(align(16)) __m128i*b = (__m128i*)ptr;
__m128i temp0,temp1,temp2,temp3,temp4;
temp0 = f0;
temp1 = f1;
temp2 = f2;
temp3 = f3;
temp0 = _mm_add_epi16(temp0, f3);
temp1 = _mm_add_epi16(temp1, f2);
f0 = _mm_sub_epi16(f0, f3);
f1 = _mm_sub_epi16(f1, f2);
temp4 = temp0;
temp4 = _mm_add_epi16(temp4, temp1);
_mm_storeu_si128(b, temp4);
temp0 = _mm_sub_epi16(temp0, temp1);
_mm_storeu_si128(b+2, temp0);
temp1 = f0;
temp4 = f1;
temp1 = _mm_slli_epi16(temp1, 1);
temp4 = _mm_slli_epi16(temp4, 1);
f0 = _mm_add_epi16(f0, temp4);
f1 = _mm_sub_epi16(f1, temp1);
_mm_storeu_si128(b+1, f0);
_mm_storeu_si128(b+3, f1);
#endif
}
推荐答案
这篇关于使用sse2内部函数进行循环展开的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!