本文介绍了使用sse2内部函数进行循环展开的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的sse2代码又长又慢,如何快速? _mm_store_si128()失败,但是_mm_storeu_si128()被接受,为什么?

My sse2 code is long and slow, how can I make it fast? _mm_store_si128() failed but _mm_storeu_si128()accepted, why?

void tom::add(void* ptr)
{
     __declspec(align(16))short* b =(short*)ptr;
     int j;
       #if cplusplus
	for(j = 0; j < 4; j++)
	   {
	/// 1st stage transform.
	int x0 = (int)(b[j]		+ b[j+12]);
	int x3 = (int)(b[j]		- b[j+12]);
	int x1 = (int)(b[j+4] + b[j+8]);
	int x2 = (int)(b[j+4] - b[j+8]);
	/// 2nd stage transform.

	b[j]		= (short)(x0 + x1);
	b[j+8]	= (short)(x0 - x1);
	b[j+4]	= (short)(x2 + (x3 << 1));
	b[j+12]	= (short)(x3 - (x2 << 1));
	}//end for j...
       #else

       __m128i f0,f1,f2,f3;

              j=0;
      f0 = _mm_set_epi32(b[j+3],b[j+2],b[j+1],b[j]);
      f1 = _mm_set_epi32(b[j+7],b[j+6],b[j+5],b[j+4]);
      f2 = _mm_set_epi32(b[j+11],b[j+10],b[j+9],b[j+8]);
      f3 = _mm_set_epi32(b[j+15],b[j+14],b[j+13],b[j+12]);
      __declspec(align(16)) __m128i*b = (__m128i*)ptr;
      __m128i temp0,temp1,temp2,temp3,temp4;
	 temp0 = f0;
	 temp1 = f1;
	 temp2 = f2;
       temp3 = f3;
	 temp0 = _mm_add_epi16(temp0, f3);
	 temp1 = _mm_add_epi16(temp1, f2);
	 f0 = _mm_sub_epi16(f0, f3);
	 f1 = _mm_sub_epi16(f1, f2);
	temp4  = temp0;
	temp4 = _mm_add_epi16(temp4, temp1);
	_mm_storeu_si128(b, temp4);
	temp0 = _mm_sub_epi16(temp0, temp1);
	_mm_storeu_si128(b+2, temp0);
	temp1 = f0;
	temp4 = f1;
	temp1 = _mm_slli_epi16(temp1, 1);
	temp4 = _mm_slli_epi16(temp4, 1);
	f0 = _mm_add_epi16(f0, temp4);
	f1 = _mm_sub_epi16(f1, temp1);
	_mm_storeu_si128(b+1, f0);
	_mm_storeu_si128(b+3, f1);
        #endif
}

推荐答案


这篇关于使用sse2内部函数进行循环展开的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-31 00:37