问题描述
我想在将SSE/AVX寄存器移位为零时向左或向右移位32位的倍数.
I want to shift SSE/AVX registers multiples of 32 bits left or right while shifting in zeros.
让我更精确地了解我感兴趣的转换.对于SSE,我想对四个32位浮点数进行以下转换:
Let me be more precise on the shifts I'm interested in. For SSE I want to do the following shifts of four 32bit floats:
shift1_SSE: [1, 2, 3, 4] -> [0, 1, 2, 3]
shift2_SSE: [1, 2, 3, 4] -> [0, 0, 1, 2]
对于AVX,我想进行以下移位:
For AVX I want to shift do the following shifts:
shift1_AVX: [1, 2, 3, 4, 5, 6, 7, 8] -> [0, 1, 2, 3, 4, 5, 6, 7]
shift2_AVX: [1, 2, 3, 4, 5, 6, 7, 8] -> [0, 0, 1, 2, 3, 4, 5, 6]
shift3_AVX: [1, 2, 3, 4 ,5 ,6, 7, 8] -> [0, 0, 0, 0, 1, 2, 3, 4]
对于SSE,我想出了以下代码
For SSE I have come up with the following code
shift1_SSE = _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 4));
shift2_SSE = _mm_shuffle_ps(_mm_setzero_ps(), x, 0x40);
//shift2_SSE = _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 8));
是否可以使用SSE更好的方法?
对于AVX,我想出了以下需要AVX2的代码(未经测试).编辑(如Paul R所述,此代码无效).
For AVX I have come up with the following code which needs AVX2 (and it's untested). Edit (as explained by Paul R this code won't work).
shift1_AVX2 =_mm256_castsi256_ps(_mm256_slli_si256(_mm256_castps_si256(x), 4)));
shift2_AVX2 =_mm256_castsi256_ps(_mm256_slli_si256(_mm256_castps_si256(x), 8)));
shift3_AVX2 =_mm256_castsi256_ps(_mm256_slli_si256(_mm256_castps_si256(x), 12)));
如何在AVX而不是AVX2上做到最好(例如,使用_mm256_permute
或_mm256_shuffle`)?是否可以使用AVX2更好的方法?
How can I do this best with AVX not AVX2 (for example with _mm256_permute
or _mm256_shuffle`)? Is there a better way to do this with AVX2?
Paul R告诉我,我的AVX2代码无法正常工作,而且AVX代码可能不值得.对于AVX2,我应该同时使用_mm256_permutevar8x32_ps
和_mm256_and_ps
.我没有配备AVX2(Haswell)的系统,因此很难测试.
Paul R has informed me that my AVX2 code won't work and that AVX code is probably not worth it. Instead for AVX2 I should use _mm256_permutevar8x32_ps
along with _mm256_and_ps
. I don't have a system with AVX2 (Haswell) so this is hard to test.
根据Felix Wyss的回答,我为AVX提出了一些解决方案,对于shift1_AVX和shift2_AVX仅需要3种本征,而对于shift3_AVX只需要一种本征.这是因为_mm256_permutef128Ps
具有归零功能.
Based on Felix Wyss's answer I came up with some solutions for AVX which only needs 3 intrisnics for shift1_AVX and shift2_AVX and only one intrinsic for shift3_AVX. This is due to the fact that _mm256_permutef128Ps
has a zeroing feature.
shift1_AVX
shift1_AVX
__m256 t0 = _mm256_permute_ps(x, _MM_SHUFFLE(2, 1, 0, 3));
__m256 t1 = _mm256_permute2f128_ps(t0, t0, 41);
__m256 y = _mm256_blend_ps(t0, t1, 0x11);
shift2_AVX
shift2_AVX
__m256 t0 = _mm256_permute_ps(x, _MM_SHUFFLE(1, 0, 3, 2));
__m256 t1 = _mm256_permute2f128_ps(t0, t0, 41);
__m256 y = _mm256_blend_ps(t0, t1, 0x33);
shift3_AVX
shift3_AVX
x = _mm256_permute2f128_ps(x, x, 41);
推荐答案
您的SSE实现很好,但是我建议您对两个转换都使用_mm_slli_si128
实现-强制转换使它看起来很复杂,但实际上可以归结为每个班次只有一条指令.
Your SSE implementation is fine but I suggest you use the _mm_slli_si128
implementation for both of the shifts - the casts make it look complicated but it really boils down to just one instruction for each shift.
不幸的是,您的AVX2实现 无法正常工作.实际上,几乎所有AVX指令都是在两个相邻的128位通道上并行运行的两条SSE指令.因此,对于您的第一个shift_AVX2示例,您将获得:
Your AVX2 implementation won't work unfortunately. Almost all AVX instructions are effectively just two SSE instructions in parallel operating on two adjacent 128 bit lanes. So for your first shift_AVX2 example you'd get:
0, 0, 1, 2, 0, 4, 5, 6
----------- ----------
LS lane MS lane
但是,所有信息并没有丢失:可以在AVX上跨通道工作的少数指令之一是 _ mm256_permutevar8x32_ps .请注意,您需要结合使用_mm256_and_ps
将移入的元素清零.还要注意,这是一个AVX2解决方案-AVX本身除了基本的算术/逻辑运算以外,在其他方面都非常受限制,因此,我认为如果没有AVX2,您将很难有效地做到这一点.
All is not lost however: one of the few instructions which does work across lanes on AVX is _mm256_permutevar8x32_ps. Note that you'll need to use an _mm256_and_ps
in conjunction with this to zero the shifted in elements. Note also that this is an AVX2 solution — AVX on its own is very limited for anything other than basic arithmetic/logic operations so I think you'll have a hard time doing this efficiently without AVX2.
这篇关于将SSE/AVX寄存器左移和右移32位,同时移零的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!