问题描述
我一直在看MMX/SSE,我想知道.对于无符号字节和字(但不是双字),有打包,饱和减法的说明.
I've been looking at MMX/SSE and I am wondering. There are instructions for packed, saturated subtraction of unsigned bytes and words, but not doublewords.
有没有一种方法可以做我想要的事情,如果没有,为什么没有呢?
Is there a way of doing what I want, or if not, why is there none?
推荐答案
如果有可用的SSE4.1,我认为没有比使用@harold建议的pmaxud
+ psubd
方法更好的方法了.借助AVX2,您当然也可以使用相应的256位变体.
If you have SSE4.1 available, I don't think you can get better than using the pmaxud
+psubd
approach suggested by @harold. With AVX2, you can of course also use the corresponding 256bit variants.
__m128i subs_epu32_sse4(__m128i a, __m128i b){
__m128i mx = _mm_max_epu32(a,b);
return _mm_sub_epi32(mx, b);
}
在没有SSE4.1的情况下,您需要以某种方式比较两个参数.不幸的是,没有epu32
比较(不在AVX512之前),但是您可以通过首先在两个参数中添加0x80000000
(在这种情况下,这相当于异或)来模拟一个:
Without SSE4.1, you need to compare both arguments in some way. Unfortunately, there is no epu32
comparison (not before AVX512), but you can simulate one by first adding 0x80000000
(which is equivalent to xor-ing in this case) to both arguments:
__m128i cmpgt_epu32(__m128i a, __m128i b) {
const __m128i highest = _mm_set1_epi32(0x80000000);
return _mm_cmpgt_epi32(_mm_xor_si128(a,highest),_mm_xor_si128(b,highest));
}
__m128i subs_epu32(__m128i a, __m128i b){
__m128i not_saturated = cmpgt_epu32(a,b);
return _mm_and_si128(not_saturated, _mm_sub_epi32(a,b));
}
在某些情况下,最好用一些最高位的位扭曲来代替比较,并使用移位将其广播到每一位(这代替了pcmpgtd
和三位-c逻辑操作(并且必须通过psrad
和五个位逻辑操作至少加载一次0x80000000
):
In some cases, it might be better to replace the comparison by some bit-twiddling of the highest bit and broadcasting that to every bit using a shift (this replaces a pcmpgtd
and three bit-logic operations (and having to load 0x80000000
at least once) by a psrad
and five bit-logic operations):
__m128i subs_epu32_(__m128i a, __m128i b) {
__m128i r = _mm_sub_epi32(a,b);
__m128i c = (~a & b) | (r & ~(a^b)); // works with gcc/clang. Replace by corresponding intrinsics, if necessary (note that `andnot` is a single instruction)
return _mm_srai_epi32(c,31) & r;
}
Godbolt-Link,还包括adds_epu32
变体: https://godbolt.org/z/n4qaW1 奇怪的是,对于非SSE4.1变体,与gcc相比,clang需要更多的寄存器副本.另一方面,使用SSE4.1编译时,clang为cmpgt_epu32
变体找到pmaxud
优化: https ://godbolt.org/z/3o5KCm
Godbolt-Link, also including adds_epu32
variants: https://godbolt.org/z/n4qaW1Strangely, clang needs more register copies than gcc for the non-SSE4.1 variants. On the other hand, clang finds the pmaxud
optimization for the cmpgt_epu32
variant when compiled with SSE4.1: https://godbolt.org/z/3o5KCm
这篇关于有没有办法使用MMX/SSE减去x86上饱和的压缩无符号双字?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!