转换将2个double数组快速交织到具有2个float和1个in

转换将2个double数组快速交织到具有2个float和1个in

本文介绍了使用SIMD double-> float转换将2个double数组快速交织到具有2个float和1个int(循环不变)成员的结构数组中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一段代码是在x86处理器上运行的C ++应用程序中的瓶颈,在这里我们从两个数组中获取双精度值,强制转换为float并存储在结构数组中.之所以成为瓶颈,是因为它被称为非常大的循环或成千上万次.

I have a section of code which is a bottleneck in a C++ application running on x86 processors, where we take double values from two arrays, cast to float and store in an array of structs. The reason this is a bottleneck is it is called either with very large loops, or thousands of times.

有没有一种更快的方法来进行此复制&使用SIMD Intrinsics进行投射操作?我已经看到在更快的memcpy上的答案,但没有解决演员表问题.

Is there a faster way to do this copy & cast operation using SIMD Intrinsics? I have seen this answer on faster memcpy but doesn't address the cast.

简单的C ++循环情况如下

The simple C++ loop case looks like this

        int _iNum;
        const unsigned int _uiDefaultOffset; // a constant

        double * pInputValues1; // array of double values, count = _iNum;
        double * pInputValues2;

        MyStruct * pOutput;    // array of outputs defined as
        // struct MyStruct
        // {
        //    float O1;
        //    float O2;
        //    unsigned int Offset;
        // };

        for (int i = 0; i < _iNum; ++i)
        {
            _pPoints[i].O1 = static_cast<float>(pInputValues1[i]);
            _pPoints[i].O2 = static_cast<float>(pInputValues2[i]);
            _pPoints[i].Offset = _uiDefaultOffset;
        }

注意::结构格式为[Float,Float,Int](24字节),但我们可以(如果有助提高性能)可以添加额外的4字节填充,使其变为32字节.

Note: The struct format is [Float,Float,Int] (24 bytes) but we could (if it will help performance) add an extra 4bytes padding making it 32 bytes.

推荐答案

受到英特尔的4x3换位示例,并基于@PeterCordes解决方案,这是一个AVX1解决方案,应具有8个结构的吞吐量在8个周期内(瓶颈仍然是p5):

Loosely inspired by Intel's 4x3 transposition example and based on @PeterCordes solution, here is an AVX1 solution, which should get a throughput of 8 structs within 8 cycles (bottleneck is still p5):

#include <immintrin.h>
#include <stddef.h>

struct f2u {
  float O1, O2;
  unsigned int Offset;
};
static const unsigned uiDefaultOffset = 123;

void cvt_interleave_avx(f2u *__restrict dst, double *__restrict pA, double *__restrict pB, ptrdiff_t len)
{
    __m256 voffset = _mm256_castsi256_ps(_mm256_set1_epi32(uiDefaultOffset));

    // 8 structs per iteration
    ptrdiff_t i=0;
    for(; i<len-7; i+=8)
    {
        // destination address for next 8 structs as float*:
        float* dst_f = reinterpret_cast<float*>(dst + i);

        // 4*vcvtpd2ps    --->  4*(p1,p5,p23)
        __m128 inA3210 = _mm256_cvtpd_ps(_mm256_loadu_pd(&pA[i]));
        __m128 inB3210 = _mm256_cvtpd_ps(_mm256_loadu_pd(&pB[i]));
        __m128 inA7654 = _mm256_cvtpd_ps(_mm256_loadu_pd(&pA[i+4]));
        __m128 inB7654 = _mm256_cvtpd_ps(_mm256_loadu_pd(&pB[i+4]));

        // 2*vinsertf128  --->  2*p5
        __m256 A76543210 = _mm256_set_m128(inA7654,inA3210);
        __m256 B76543210 = _mm256_set_m128(inB7654,inB3210);

        // 2*vpermilps    --->  2*p5
        __m256 A56741230 = _mm256_shuffle_ps(A76543210,A76543210,_MM_SHUFFLE(1,2,3,0));
        __m256 B67452301 = _mm256_shuffle_ps(B76543210,B76543210,_MM_SHUFFLE(2,3,0,1));

        // 6*vblendps     ---> 6*p015 (does not need to use p5)
        __m256 outA1__B0A0 = _mm256_blend_ps(A56741230,B67452301,2+16*2);
        __m256 outA1ccB0A0 = _mm256_blend_ps(outA1__B0A0,voffset,4+16*4);

        __m256 outB2A2__B1 = _mm256_blend_ps(B67452301,A56741230,4+16*4);
        __m256 outB2A2ccB1 = _mm256_blend_ps(outB2A2__B1,voffset,2+16*2);

        __m256 outccB3__cc = _mm256_blend_ps(voffset,B67452301,4+16*4);
        __m256 outccB3A3cc = _mm256_blend_ps(outccB3__cc,A56741230,2+16*2);

        // 3* vmovups     ---> 3*(p237,p4)
        _mm_storeu_ps(dst_f+ 0,_mm256_castps256_ps128(outA1ccB0A0));
        _mm_storeu_ps(dst_f+ 4,_mm256_castps256_ps128(outB2A2ccB1));
        _mm_storeu_ps(dst_f+ 8,_mm256_castps256_ps128(outccB3A3cc));
        // 3*vextractf128 ---> 3*(p23,p4)
        _mm_storeu_ps(dst_f+12,_mm256_extractf128_ps(outA1ccB0A0,1));
        _mm_storeu_ps(dst_f+16,_mm256_extractf128_ps(outB2A2ccB1,1));
        _mm_storeu_ps(dst_f+20,_mm256_extractf128_ps(outccB3A3cc,1));
    }

    // scalar cleanup for  if _iNum is not even
    for (; i < len; i++)
    {
        dst[i].O1 = static_cast<float>(pA[i]);
        dst[i].O2 = static_cast<float>(pB[i]);
        dst[i].Offset = uiDefaultOffset;
    }
}

Godbolt链接,末尾带有最少的测试代码: https://godbolt.org/z/0kTO2b

Godbolt link, with minimal test-code at the end: https://godbolt.org/z/0kTO2b

由于某种原因,gcc不喜欢生成直接从内存转换为寄存器的 vcvtpd2ps . 在对齐的负载下效果更好(使输入和输出对齐可能是有益的反正).而且clang显然想以最后的 vextractf128 指令之一使我胜过.

For some reason, gcc does not like to generate vcvtpd2ps which directly convert from memory to a register. This works better with aligned loads (having input and output aligned is likely beneficial anyway). And clang apparently wants to outsmart me with one of the vextractf128 instructions at the end.

这篇关于使用SIMD double-> float转换将2个double数组快速交织到具有2个float和1个int(循环不变)成员的结构数组中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-29 06:09