0 1 2 3 4 5 6 7 8 9 10 ... 100
我想加载例如0,10,20,30 -
VGATHERQPD ymm1,[RSI + XMM7 * 8],ymm2
which can achieve this with one instruction.Here ymm2
is the mask register with the highest bit indicating if the value should be copied to ymm1
or not(left unchanged).ymm7
contains the indices of the elements with the scale factor.
So applied to your examples, it could look like this in MASM syntax:
.align 16
qqIndices dq 0,10,20,30
dpValues REAL8 0,1,2,3, ... 100
lea rsi, dpValues
movapd ymm7, qqIndices
vpcmpeqw ymm1, ymm1 ; set to all ones
vgatherqpd ymm0, [rsi+xmm7*8], ymm1
Now ymm0
contains four doubles 0, 10, 20, 30.Though, I haven't tested this yet. Another thing to mention is, that this is not necessarily the fastest choice in every scenario. The values are all gathered separately, that means, each value needs one memory access, see How are the gather instructions in AVX2 implemented
So according to Mysticial's comment
the fastest way would be using that approach.
So "efficient" in an OpCode way would be using VGATHER
and "efficient" relating to execution time would be the last one (so far, let's see how future architectures will perform).
EDIT: according to comments the VGATHER
instructions get faster on Broadwell and Skylake.