本文介绍了__m256d TRANSPOSE4是否等效?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Intel已包含__MM_TRANPOSE4_PS来转置4x4向量矩阵.我想用__m256d做同样的事情.但是,我似乎无法弄清楚如何以相同的方式获取_mm256_shuffle_pd.

Intel has included __MM_TRANPOSE4_PS to transpose a 4x4 matrix of vectors. I'm wanting to do the equivalent with __m256d. However, I can't seem to figure out how to get _mm256_shuffle_pd in the same manner.

_MM_TRANSPOSE4_PS代码

_MM_TRANSPOSE4_PS Code

#define _MM_TRANSPOSE4_PS(row0, row1, row2, row3) {                 \
            __m128 tmp3, tmp2, tmp1, tmp0;                          \
                                                                    \
            tmp0   = _mm_shuffle_ps((row0), (row1), 0x44);          \
            tmp2   = _mm_shuffle_ps((row0), (row1), 0xEE);          \
            tmp1   = _mm_shuffle_ps((row2), (row3), 0x44);          \
            tmp3   = _mm_shuffle_ps((row2), (row3), 0xEE);          \
                                                                    \
            (row0) = _mm_shuffle_ps(tmp0, tmp1, 0x88);              \
            (row1) = _mm_shuffle_ps(tmp0, tmp1, 0xDD);              \
            (row2) = _mm_shuffle_ps(tmp2, tmp3, 0x88);              \
            (row3) = _mm_shuffle_ps(tmp2, tmp3, 0xDD);              \
        }

我在循环中尝试_MM_TRANSPOSE4_PD的尝试

My attempt at a _MM_TRANSPOSE4_PD inside a loop i need it in

for (int copy = i; copy < m2.size();)
{
    __m256d row0 = _mm256_load_pd(m2data + copy);
    copy += m2.col();
    __m256d row1 = _mm256_load_pd(m2data + copy);
    copy += m2.col();
    __m256d row2 = _mm256_load_pd(m2data + copy);
    copy += m2.col();
    __m256d row3 = _mm256_load_pd(m2data + copy);
    copy += m2.col();

    __m256d tmp3, tmp2, tmp1, tmp0;

    tmp0 = _mm256_shuffle_pd(row0,row1, 0x44);
    tmp2 = _mm256_shuffle_pd(row0,row1, 0xEE);
    tmp1 = _mm256_shuffle_pd(row2,row3, 0x44);
    tmp3 = _mm256_shuffle_pd(row2,row3, 0xEE);

    row0 = _mm256_shuffle_pd(tmp0, tmp1, 0x88);
    row1 = _mm256_shuffle_pd(tmp0, tmp1, 0xDD);
    row2 = _mm256_shuffle_pd(tmp2, tmp3, 0x88);
    row3 = _mm256_shuffle_pd(tmp2, tmp3, 0xDD);

    _mm256_store_pd(reinterpret_cast<double*>(buffer + counter++),row0);
    _mm256_store_pd(reinterpret_cast<double*>(buffer + counter++),row1);
    _mm256_store_pd(reinterpret_cast<double*>(buffer + counter++),row2);
    _mm256_store_pd(reinterpret_cast<double*>(buffer + counter++),row3);
}

推荐答案

这是我找到的解决方案的等效宏.

Here is the macro equivalent of the solution I found.

  #define _MM_TRANSPOSE4_PD(row0,row1,row2,row3)                                 \
                {                                                                \
                    __m256d tmp3, tmp2, tmp1, tmp0;                              \
                                                                                 \
                    tmp0 = _mm256_shuffle_pd((row0),(row1), 0x0);                    \
                    tmp2 = _mm256_shuffle_pd((row0),(row1), 0xF);                \
                    tmp1 = _mm256_shuffle_pd((row2),(row3), 0x0);                    \
                    tmp3 = _mm256_shuffle_pd((row2),(row3), 0xF);                \
                                                                                 \
                    (row0) = _mm256_permute2f128_pd(tmp0, tmp1, 0x20);   \
                    (row1) = _mm256_permute2f128_pd(tmp2, tmp3, 0x20);   \
                    (row2) = _mm256_permute2f128_pd(tmp0, tmp1, 0x31);   \
                    (row3) = _mm256_permute2f128_pd(tmp2, tmp3, 0x31);   \
                }

这篇关于__m256d TRANSPOSE4是否等效?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-29 15:03