本文介绍了最佳uint8_t位图转换为8×32位SIMD“bool”向量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为压缩算法的一部分,我正在寻找实现以下的最佳方式:

As part of a compression algorithm, I am looking for the optimal way to achieve the following:

我在一个 uint8_t 。例如01010011

I have a simple bitmap in a uint8_t. For example 01010011

我想要的是一个 __ m256i ,形式如下:(0,maxint,0,maxint, 0,0,maxint,maxint)

What I want is a __m256i of the form: (0, maxint, 0, maxint, 0, 0, maxint, maxint)

一种实现方法是将8 x maxint的向量重组为零向量。但是首先需要我把 uint8_t 扩展到正确的shuffle位图。

One way to achieve this is by shuffling a vector of 8 x maxint into a vector of zeros. But that first requires me to expand my uint8_t to the right shuffle bitmap.

我想知道是否有更好的方式?

I am wondering if there is a better way?

推荐答案

这里有一个解决方案(PaulR改进了我的解决方案,此问题的一个变体。

Here is a solution (PaulR improved my solution, see the end of my answer or his answer) based on a variation of this question fastest-way-to-broadcast-32-bits-in-32-bytes.

__m256i t1 = _mm256_set1_epi8(x);
__m256i t2 = _mm256_and_si256(t1, mask);
__m256i t4 = _mm256_cmpeq_epi32(t2, _mm256_setzero_si256());
t4 = _mm256_xor_si256(t4, _mm256_set1_epi32(-1));

我没有AVX2硬件来测试这个现在,但这里是一个SSE2版本显示它还工作,它也显示如何定义掩码。

I don't have AVX2 hardware to test this on right now but here is a SSE2 version showing that it works which also shows how to define the mask.

#include <x86intrin.h>
#include <stdint.h>
#include <stdio.h>

int main(void) {
    char mask[32] = {
        0x01, 0x00, 0x00, 0x00,
        0x02, 0x00, 0x00, 0x00,
        0x04, 0x00, 0x00, 0x00,
        0x08, 0x00, 0x00, 0x00,
        0x10, 0x00, 0x00, 0x00,
        0x20, 0x00, 0x00, 0x00,
        0x40, 0x00, 0x00, 0x00,
        0x80, 0x00, 0x00, 0x00,
    };
    __m128i mask1 = _mm_loadu_si128((__m128i*)&mask[ 0]);
    __m128i mask2 = _mm_loadu_si128((__m128i*)&mask[16]);

    uint8_t x = 0x53; //0101 0011
    __m128i t1 = _mm_set1_epi8(x);
    __m128i t2 = _mm_and_si128(t1, mask1);
    __m128i t3 = _mm_and_si128(t1, mask2);
    __m128i t4 = _mm_cmpeq_epi32(t2,_mm_setzero_si128());
    __m128i t5 = _mm_cmpeq_epi32(t3,_mm_setzero_si128());
    t4 = _mm_xor_si128(t4, _mm_set1_epi32(-1));
    t5 = _mm_xor_si128(t5, _mm_set1_epi32(-1));

    int o1[4], o2[4];
    _mm_store_si128((__m128i*)o1, t4);
    _mm_store_si128((__m128i*)o2, t5);
    for(int i=0; i<4; i++) printf("%d \n", o1[i]);
    for(int i=0; i<4; i++) printf("%d \n", o2[i]);

}

编辑:

PaulR改善了我的解决方案

PaulR improved my solution

__m256i v = _mm256_set1_epi8(u);
v = _mm256_and_si256(v, mask);
v = _mm256_xor_si256(v, mask);
return _mm256_cmpeq_epi32(v, _mm256_setzero_si256());

掩码定义为

int mask[8] = {
    0x01010101, 0x02020202, 0x04040404, 0x08080808,
    0x10101010, 0x20202020, 0x40404040, 0x80808080,
};

有关详细信息,请参阅性能测试的答案。

See his answer with performance testing for more details.

这篇关于最佳uint8_t位图转换为8×32位SIMD“bool”向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-30 19:14