一种避免混洗-掩码向量常量,但性能较差的替代方法(因为它需要两倍的混洗次数): vpmovzxdq 是将上部两个元素置于上部的另一种选择128bit通道.; slow, not recommended. Avoids using a register for shuffle-control, though.vpmovzxdq ymm0, [src]vpshufd ymm1, ymm0, _MM_SHUFFLE(2,2, 0,0) ; duplicate elementsvpaddd ...或者,如果shuffle端口是整个循环的瓶颈,则吞吐量可能会高于上述2-shuffle版本. (尽管比vpermd版本还差.); slow, not recommended.vpmovzxdq ymm0, [src]vpsllq ymm1, ymm0,32 ; left shift by 32vpor ymm0, ymm0, odd_ones ; OR with set1_epi64x(1ULL << 32)vpaddd ymm0, ymm0, ymm1 ; I_n+0 in even elements, 1+I_n in odd这具有一些指令级并行性:OR可以与shift并行运行.但是它仍然因糟糕的情况而糟透了.如果您没有使用向量regs,最好还是使用内存中的随机控制向量.I have an aligned array of integers in memory containing indices I0, I1, I2, I3. My goal is to get them into a __m256i register containing I0, I0 + 1, I1, I1 + 1, I2, I2 + 1, I3, I3 + 1. The hard part is getting them into the 256 bit register as I0, I0, I1, I1, I2, I2, I3, I3, after which I can just add a register containing 0, 1, 0, 1, 0, 1, 0, 1.I found the intrinsic, _mm256_castsi128_si256, which lets me load the 4 integers into the lower 128 bits of the 256 bit register, but I'm struggling to find the best intrinsics to use from there.Any help would be appreciated. I have access to all SSE versions, AVX, and AVX2 and would like to do this using intrinsics only.Edit:I think this works, but I'm not how efficient it is...in the process of testing it.// _mm128_load_si128: Loads 4 integer values into a temporary 128bit register.// _mm256_broadcastsi128_si256: Copies 4 integer values in the 128 bit register to the low and high 128 bits of the 256 bit register.__m256i tmpStuff = _mm256_broadcastsi128_si256 ((_mm_load_si128((__m128i*) indicesArray)));// _mm256_unpacklo_epi32: Interleaves the integer values of source0 and source1.__m256i indices = _mm256_unpacklo_epi32(tmpStuff, tmpStuff);__m256i regToAdd = _mm256_set_epi32 (0, 1, 0, 1, 0, 1, 0, 1);indices = _mm256_add_epi32(indices, regToAdd);Edit2: The above code does not work because _mm256_unpacklo_epi32 does not behave the way I thought. The code above will result in I0, I0+1, I1, I1+1, I0,I0+1, I1, I1+1.Edit3: The following code works, though again I'm not sure if it's the most efficient:__m256i tmpStuff = _mm256_castsi128_si256(_mm_loadu_si128((__m128i*) indicesArray));__m256i mask = _mm256_set_epi32 (3, 3, 2, 2, 1, 1, 0, 0);__m256i indices= _mm256_permutevar8x32_epi32(tmpStuff, mask);__m256i regToAdd = _mm256_set_epi32 (1, 0, 1, 0, 1, 0, 1, 0); // Set in reverse order.indices= _mm256_add_epi32(indices, regToAdd); 解决方案 Your _mm256_permutevar8x32_epi32 version looks ideal for Intel CPUs, unless I'm missing a way that could fold the shuffle into a 128b load. That could help slightly for fused-domain uop throughput, but not for unfused-domain.1 load (vmovdqa), 1 shuffle (vpermd, aka _mm256_permutevar8x32_epi32) and 1 add (vpaddd) is pretty light-weight. On Intel, lane-crossing shuffles have extra latency but no worse throughput. On AMD Ryzen, lane-crossing shuffles are more expensive. (http://agner.org/optimize/).Since you can use AVX2, your solution is great if loading a shuffle mask for vpermd isn't a problem. (register pressure / cache misses).Beware that _mm256_castsi128_si256 doesn't guarantee the high half of the __m256i is all zero. But you don't depend on this, so your code is totally fine.BTW, you could use one 256-bit load and unpack it 2 different ways with vpermd. Use another mask with all elements 4 higher.Another option is an unaligned 256b load with the lane-split in the middle of your 4 elements, so you have 2 elements at the bottom of the high lane and 2 at the top of the low lane. Then you can use an in-lane shuffle to put your data where it's needed. But it's a different shuffle in each lane, so you will still need a shuffle that takes the control operand in a register (not an immediate) to do it in a single operation. (vpshufd and vpermilps imm8 recycle the same immediate for both lanes.) The only shuffles where different bits of the immediate affect the upper / lower lane separately are qword granularity shuffles like vpermq (_mm256_permutex_epi64, not permutexvar).You could use vpermilps ymm,ymm,ymm, or vpshufb (_mm256_shuffle_epi8) for this, which will run more efficiently on Ryzen than a lane-crossing vpermd (probably 3 uops / 1 per 4c throughput if it's the same as vpermps, according to Agner FogBut using an unaligned load is not appealing when your data is already aligned, and all it gains is an in-lane vs. lane-crossing shuffle. If you'd needed a 16 or 8-bit granularity shuffle, it would probably be worth it (because there is no lane-crossing byte or word shuffle until AVX512, and on Skylake-AVX512 vpermw is multiple uops.)An alternative that avoids a shuffle-mask vector constant, but is worse performance (because it takes twice as many shuffles):vpmovzxdq is another option for getting the upper two elements into the upper 128bit lane.; slow, not recommended. Avoids using a register for shuffle-control, though.vpmovzxdq ymm0, [src]vpshufd ymm1, ymm0, _MM_SHUFFLE(2,2, 0,0) ; duplicate elementsvpaddd ...Or, possibly higher throughput than the 2-shuffle version above if the shuffle port is a bottleneck for the whole loop. (Still worse than the vpermd version, though.); slow, not recommended.vpmovzxdq ymm0, [src]vpsllq ymm1, ymm0,32 ; left shift by 32vpor ymm0, ymm0, odd_ones ; OR with set1_epi64x(1ULL << 32)vpaddd ymm0, ymm0, ymm1 ; I_n+0 in even elements, 1+I_n in oddThis has some instruction-level parallelism: the OR can run in parallel with the shift. But it still sucks for being more uops; if you're out of vector regs is probably still best to use a shuffle-control vector from memory. 这篇关于AVX2,如何有效地将四个整数加载到256位寄存器的偶数索引并复制到奇数索引?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 上岸,阿里云! 08-29 14:09