问题描述
我应该算一个__m128i寄存器的设置的位数。
特别是,我应该写两个函数都能够计数寄存器的比特数,用以下方式
I should count the number of set bits of a __m128i register.In particular, I should write two functions that are able to count the number of bits of the register, using the following ways.
- 寄存器的组的比特的总数。
- 设置位为寄存器的每个字节数。
有没有可以执行全部或部分内部函数,上述操作?
Are there intrinsic functions that can perform, wholly or partially, the above operations?
推荐答案
下面是一些codeS我的老项目(有关于它)的研究论文。功能 popcnt8
以下计算中的每个字节设置的位数。
Here are some codes I used in an old project (there is a research paper about it). The function popcnt8
below computes the number of bits set in each byte.
只有SSE2版本(在基于算法3):
SSE2-only version (based on Algorithm 3 in Hacker's Delight book):
static const __m128i popcount_mask1 = _mm_set1_epi8(0x77);
static const __m128i popcount_mask2 = _mm_set1_epi8(0x0F);
static inline __m128i popcnt8(__m128i x) {
__m128i n;
// Count bits in each 4-bit field.
n = _mm_srli_epi64(x, 1);
n = _mm_and_si128(popcount_mask1, n);
x = _mm_sub_epi8(x, n);
n = _mm_srli_epi64(n, 1);
n = _mm_and_si128(popcount_mask1, n);
x = _mm_sub_epi8(x, n);
n = _mm_srli_epi64(n, 1);
n = _mm_and_si128(popcount_mask1, n);
x = _mm_sub_epi8(x, n);
x = _mm_add_epi8(x, _mm_srli_epi16(x, 4));
x = _mm_and_si128(popcount_mask2, x);
return x;
}
SSSE3版本(由于):
static const __m128i popcount_mask = _mm_set1_epi8(0x0F);
static const __m128i popcount_table = _mm_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);
static inline __m128i popcnt8(__m128i n) {
const __m128i pcnt0 = _mm_shuffle_epi8(popcount_table, _mm_and_si128(n, popcount_mask));
const __m128i pcnt1 = _mm_shuffle_epi8(popcount_table, _mm_and_si128(_mm_srli_epi16(n, 4), popcount_mask));
return _mm_add_epi8(pcnt0, pcnt1);
}
XOP版本(相当于SSSE3,但使用XOP指令,速度更快的AMD推土机)
XOP version (equivalent to SSSE3, but uses XOP instructions which are faster on AMD Bulldozer)
static const __m128i popcount_mask = _mm_set1_epi8(0x0F);
static const __m128i popcount_table = _mm_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);
static const __m128i popcount_shift = _mm_set1_epi8(-4);
static inline __m128i popcount8(__m128i n) {
const __m128i pcnt0 = _mm_perm_epi8(popcount_table, popcount_table, _mm_and_si128(n, popcount_mask));
const __m128i pcnt1 = _mm_perm_epi8(popcount_table, popcount_table, _mm_shl_epi8(n, popcount_shift));
return _mm_add_epi8(pcnt0, pcnt1);
}
功能 popcnt64
位中的低和高64位的零部件数量SSE寄存器下方计数
SSE2版本:
SSE2 version:
static inline __m128i popcnt64(__m128i n) {
const __m128i cnt8 = popcnt8(n);
return _mm_sad_epu8(cnt8, _mm_setzero_si128());
}
XOP版本:
static inline __m128i popcnt64(__m128i n) {
const __m128i cnt8 = popcnt8(n);
return _mm_haddq_epi8(cnt8);
}
最后,函数 popcnt128
计数低于整个128位寄存器的位数:
Finally, the function popcnt128
below count the number of bits in the whole 128-bit register:
static inline int popcnt128(__m128i n) {
const __m128i cnt64 = popcnt64(n);
const __m128i cnt64_hi = _mm_unpackhi_epi64(cnt64, cnt64);
const __m128i cnt128 = _mm_add_epi32(cnt64, cnt64_hi);
return _mm_cvtsi128_si32(cnt128);
}
不过,更有效的方式来实施 popcnt128
是使用硬件POPCNT指令(上支持它的处理器):
However, a more efficient way to implement popcnt128
is to use hardware POPCNT instruction (on processors which support it):
static inline int popcnt128(__m128i n) {
const __m128i n_hi = _mm_unpackhi_epi64(n, n);
#ifdef _MSC_VER
return __popcnt64(_mm_cvtsi128_si64(n)) + __popcnt64(_mm_cvtsi128_si64(n_hi));
#else
return __popcntq(_mm_cvtsi128_si64(n)) + __popcntq(_mm_cvtsi128_si64(n_hi));
#endif
}
这篇关于快速计数在__m128i寄存器组的比特数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!