有什么有效的方式来加载64位寄存器青运与4分开双打

有什么有效的方式来加载64位寄存器青运与4分开双打

本文介绍了有什么有效的方式来加载64位寄存器青运与4分开双打?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

什么是装载了64位YMM寄存器的最有效方式。


  1. 4双打均匀分布的,即一组连续的双打

      0 1 2 3 4 5 6 7 8 9 10 ... 100
    我想加载例如0,10,20,30


  2. 4双打在任何位置

     即。我想加载例如1,6,22,43



解决方案

最简单的方法是它可在Haswell的和最多的AVX2指令。

  VGATHERQPD ymm1,[RSI + XMM7 * 8],ymm2

which can achieve this with one instruction.Here ymm2 is the mask register with the highest bit indicating if the value should be copied to ymm1 or not(left unchanged).ymm7 contains the indices of the elements with the scale factor.

So applied to your examples, it could look like this in MASM syntax:

.data
  .align 16
  qqIndices dq 0,10,20,30
  dpValues  REAL8 0,1,2,3, ... 100
.code
  lea rsi, dpValues
  movapd ymm7, qqIndices
  vpcmpeqw ymm1, ymm1                     ; set to all ones
  vgatherqpd ymm0, [rsi+xmm7*8], ymm1

Now ymm0 contains four doubles 0, 10, 20, 30.Though, I haven't tested this yet. Another thing to mention is, that this is not necessarily the fastest choice in every scenario. The values are all gathered separately, that means, each value needs one memory access, see How are the gather instructions in AVX2 implemented

So according to Mysticial's comment

the fastest way would be using that approach.

So "efficient" in an OpCode way would be using VGATHER and "efficient" relating to execution time would be the last one (so far, let's see how future architectures will perform).

EDIT: according to comments the VGATHER instructions get faster on Broadwell and Skylake.

这篇关于有什么有效的方式来加载64位寄存器青运与4分开双打?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-04 19:21