问题描述
作为压缩算法的一部分,我正在寻找实现以下的最佳方式:
As part of a compression algorithm, I am looking for the optimal way to achieve the following:
我在一个 uint8_t
。例如01010011
I have a simple bitmap in a uint8_t
. For example 01010011
我想要的是一个 __ m256i
,形式如下:(0,maxint,0,maxint, 0,0,maxint,maxint)
What I want is a __m256i
of the form: (0, maxint, 0, maxint, 0, 0, maxint, maxint)
一种实现方法是将8 x maxint的向量重组为零向量。但是首先需要我把 uint8_t
扩展到正确的shuffle位图。
One way to achieve this is by shuffling a vector of 8 x maxint into a vector of zeros. But that first requires me to expand my uint8_t
to the right shuffle bitmap.
我想知道是否有更好的方式?
I am wondering if there is a better way?
推荐答案
这里有一个解决方案(PaulR改进了我的解决方案,此问题的一个变体。
Here is a solution (PaulR improved my solution, see the end of my answer or his answer) based on a variation of this question fastest-way-to-broadcast-32-bits-in-32-bytes.
__m256i t1 = _mm256_set1_epi8(x);
__m256i t2 = _mm256_and_si256(t1, mask);
__m256i t4 = _mm256_cmpeq_epi32(t2, _mm256_setzero_si256());
t4 = _mm256_xor_si256(t4, _mm256_set1_epi32(-1));
我没有AVX2硬件来测试这个现在,但这里是一个SSE2版本显示它还工作,它也显示如何定义掩码。
I don't have AVX2 hardware to test this on right now but here is a SSE2 version showing that it works which also shows how to define the mask.
#include <x86intrin.h>
#include <stdint.h>
#include <stdio.h>
int main(void) {
char mask[32] = {
0x01, 0x00, 0x00, 0x00,
0x02, 0x00, 0x00, 0x00,
0x04, 0x00, 0x00, 0x00,
0x08, 0x00, 0x00, 0x00,
0x10, 0x00, 0x00, 0x00,
0x20, 0x00, 0x00, 0x00,
0x40, 0x00, 0x00, 0x00,
0x80, 0x00, 0x00, 0x00,
};
__m128i mask1 = _mm_loadu_si128((__m128i*)&mask[ 0]);
__m128i mask2 = _mm_loadu_si128((__m128i*)&mask[16]);
uint8_t x = 0x53; //0101 0011
__m128i t1 = _mm_set1_epi8(x);
__m128i t2 = _mm_and_si128(t1, mask1);
__m128i t3 = _mm_and_si128(t1, mask2);
__m128i t4 = _mm_cmpeq_epi32(t2,_mm_setzero_si128());
__m128i t5 = _mm_cmpeq_epi32(t3,_mm_setzero_si128());
t4 = _mm_xor_si128(t4, _mm_set1_epi32(-1));
t5 = _mm_xor_si128(t5, _mm_set1_epi32(-1));
int o1[4], o2[4];
_mm_store_si128((__m128i*)o1, t4);
_mm_store_si128((__m128i*)o2, t5);
for(int i=0; i<4; i++) printf("%d \n", o1[i]);
for(int i=0; i<4; i++) printf("%d \n", o2[i]);
}
编辑:
PaulR改善了我的解决方案
PaulR improved my solution
__m256i v = _mm256_set1_epi8(u);
v = _mm256_and_si256(v, mask);
v = _mm256_xor_si256(v, mask);
return _mm256_cmpeq_epi32(v, _mm256_setzero_si256());
掩码定义为
int mask[8] = {
0x01010101, 0x02020202, 0x04040404, 0x08080808,
0x10101010, 0x20202020, 0x40404040, 0x80808080,
};
有关详细信息,请参阅性能测试的答案。
See his answer with performance testing for more details.
这篇关于最佳uint8_t位图转换为8×32位SIMD“bool”向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!