本文介绍了__m256d TRANSPOSE4是否等效?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
Intel已包含__MM_TRANPOSE4_PS来转置4x4向量矩阵.我想用__m256d做同样的事情.但是,我似乎无法弄清楚如何以相同的方式获取_mm256_shuffle_pd.
Intel has included __MM_TRANPOSE4_PS to transpose a 4x4 matrix of vectors. I'm wanting to do the equivalent with __m256d. However, I can't seem to figure out how to get _mm256_shuffle_pd in the same manner.
_MM_TRANSPOSE4_PS代码
_MM_TRANSPOSE4_PS Code
#define _MM_TRANSPOSE4_PS(row0, row1, row2, row3) { \
__m128 tmp3, tmp2, tmp1, tmp0; \
\
tmp0 = _mm_shuffle_ps((row0), (row1), 0x44); \
tmp2 = _mm_shuffle_ps((row0), (row1), 0xEE); \
tmp1 = _mm_shuffle_ps((row2), (row3), 0x44); \
tmp3 = _mm_shuffle_ps((row2), (row3), 0xEE); \
\
(row0) = _mm_shuffle_ps(tmp0, tmp1, 0x88); \
(row1) = _mm_shuffle_ps(tmp0, tmp1, 0xDD); \
(row2) = _mm_shuffle_ps(tmp2, tmp3, 0x88); \
(row3) = _mm_shuffle_ps(tmp2, tmp3, 0xDD); \
}
我在循环中尝试_MM_TRANSPOSE4_PD的尝试
My attempt at a _MM_TRANSPOSE4_PD inside a loop i need it in
for (int copy = i; copy < m2.size();)
{
__m256d row0 = _mm256_load_pd(m2data + copy);
copy += m2.col();
__m256d row1 = _mm256_load_pd(m2data + copy);
copy += m2.col();
__m256d row2 = _mm256_load_pd(m2data + copy);
copy += m2.col();
__m256d row3 = _mm256_load_pd(m2data + copy);
copy += m2.col();
__m256d tmp3, tmp2, tmp1, tmp0;
tmp0 = _mm256_shuffle_pd(row0,row1, 0x44);
tmp2 = _mm256_shuffle_pd(row0,row1, 0xEE);
tmp1 = _mm256_shuffle_pd(row2,row3, 0x44);
tmp3 = _mm256_shuffle_pd(row2,row3, 0xEE);
row0 = _mm256_shuffle_pd(tmp0, tmp1, 0x88);
row1 = _mm256_shuffle_pd(tmp0, tmp1, 0xDD);
row2 = _mm256_shuffle_pd(tmp2, tmp3, 0x88);
row3 = _mm256_shuffle_pd(tmp2, tmp3, 0xDD);
_mm256_store_pd(reinterpret_cast<double*>(buffer + counter++),row0);
_mm256_store_pd(reinterpret_cast<double*>(buffer + counter++),row1);
_mm256_store_pd(reinterpret_cast<double*>(buffer + counter++),row2);
_mm256_store_pd(reinterpret_cast<double*>(buffer + counter++),row3);
}
推荐答案
这是我找到的解决方案的等效宏.
Here is the macro equivalent of the solution I found.
#define _MM_TRANSPOSE4_PD(row0,row1,row2,row3) \
{ \
__m256d tmp3, tmp2, tmp1, tmp0; \
\
tmp0 = _mm256_shuffle_pd((row0),(row1), 0x0); \
tmp2 = _mm256_shuffle_pd((row0),(row1), 0xF); \
tmp1 = _mm256_shuffle_pd((row2),(row3), 0x0); \
tmp3 = _mm256_shuffle_pd((row2),(row3), 0xF); \
\
(row0) = _mm256_permute2f128_pd(tmp0, tmp1, 0x20); \
(row1) = _mm256_permute2f128_pd(tmp2, tmp3, 0x20); \
(row2) = _mm256_permute2f128_pd(tmp0, tmp1, 0x31); \
(row3) = _mm256_permute2f128_pd(tmp2, tmp3, 0x31); \
}
这篇关于__m256d TRANSPOSE4是否等效?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!