问题描述
AVX指令集引入了VPERMILPS,它似乎是SHUFPS的简化版本(对于两个输入寄存器相同的情况).
The AVX instruction set introduced VPERMILPS which seems to be a simplified version of SHUFPS (for the case where both input registers are the same).
例如,以下说明:
c5 f0 c6 c1 00 vshufps xmm0,xmm1,xmm1,0x0
可以替换为:
c4 e3 79 04 c1 00 vpermilps xmm0,xmm1,0x0
如您所见,VPERMILPS版本多花了一个字节,并且执行相同的操作.根据指令表,两条指令占用1个CPU周期,并且具有相同的吞吐量.
As you can see, the VPERMILPS version takes one byte extra and does the same thing. According to the instruction tables, both of the instructions take 1 CPU cycle and have the same throughput.
引入这种指导的意义何在?我想念什么吗?
What's the point of introducing this kind of instruction? Am I missing something?
推荐答案
是的,与vshufps
相比,使用vpermilps
-immediate通常是错过的优化(在Knight's Landing上除外),为具有相同性能的相同操作.
Yes using vpermilps
-immediate is normally a missed-optimization vs. vshufps
(except on Knight's Landing), wasting 1 byte of code size for the same operation with the same performance.
我认为vpermilps
的要点是它可以与向量控制操作数一起使用.在AVX之前,唯一的可变控制随机播放是整数pshufb
.
I think the main point of vpermilps
is that it's available with a vector control operand. Before AVX, the only variable-control shuffle was integer pshufb
.
但是,当然,立即数形式具有完全独立的操作码,您在问为什么它存在.英特尔肯定可以只包含矢量版本,因此问题变成为什么它们包含即时版本?" .它至少需要一点额外的解码硬件.随机播放单元已经具有以这种形式解包立即控制操作数的硬件,因为它与vshufps
相同,因此实现起来可能便宜吗?
But of course the immediate form has a totally separate opcode, and you're asking why it exists. Intel definitely could have included only the vector version, so the question becomes "why did they include the immediate version?" It takes at least a bit of extra decode hardware. The shuffle unit already has hardware to unpack immediate control operands in this form, because it's identical to vshufps
, so perhaps it was cheap-ish to implement?
唯一的即时vpermilps
不能用vshufps
进行的操作是 load + shuffle一条指令,例如vpermilps ymm0, [rdi], 0b00011011
可以反转指令的每个通道中的元素来源.但是,像大多数带有立即数的指令一样,它无法对存储器操作数进行微融合,因此前端仍为2个融合域. (在AMD CPU上,它确实确实节省了前端带宽.)不过,与vmovups ymm0, [rdi]
/vshufps ymm0,ymm0,ymm0, 0b00011011
相比,它节省了代码大小.
The only thing you can do with immediate vpermilps
that you can't with vshufps
is load+shuffle in one instruction, like vpermilps ymm0, [rdi], 0b00011011
to reverse the elements in each lane of the source. But like most instructions with an immediate, it can't micro-fuse a memory operand so it's still 2 fused-domain uops for the front end. (On AMD CPUs, it actually does save front-end bandwidth.) Still, it saves code-size vs. vmovups ymm0, [rdi]
/ vshufps ymm0,ymm0,ymm0, 0b00011011
.
除此之外,我没有多大意义.它们在两个128位通道中都执行相同的混洗,为两个通道重用立即数的4x 2位字段. (同时 vpermilpd
和 vshufpd
都在其立即数中使用1位字段,并且可以在每个通道中进行不同的混洗;较高通道使用2位和3位. ZMM版本使用高位256的4..7位.因此vpermilpd dst, src, imm
与vshufpd dst, src,src, imm
相同,除非您使用内存源或使用改组控制向量而不是立即数.)
Other than that, I don't see much point. They both do the same shuffle in both 128-bit lanes, reusing the 4x 2-bit fields of the immediate for both lanes. (While vpermilpd
and vshufpd
both use 1-bit fields in their immediates, and can do different shuffles in each lane; the upper lane uses bits 2 and 3. And the ZMM versions use bits 4..7 for the upper 256. So again vpermilpd dst, src, imm
is identical to vshufpd dst, src,src, imm
, unless you use a memory source or you use a shuffle-control vector instead of immediate.)
这让您想知道英特尔是否忘记了VEX编码将使非破坏性vshufps
能够立即进行洗牌.
It makes you wonder if Intel forgot that VEX encoding was going to enable non-destructive vshufps
to do the same thing for immediate shuffles.
或者也许他们想起了低功耗CPU,例如Knight's Landing(至强披披),其中一站式洗牌更便宜:
vpermilps
在那里具有1个周期的吞吐量,但是vshufps
或vperm2f128
却具有2个周期的吞吐量和额外的延迟周期. (根据 Agner Fog的说明表.)
vpermilps
has 1-cycle throughput there, but vshufps
or vperm2f128
has 2-cycle throughput and an extra cycle of latency. (According to Agner Fog's instruction tables.)
因此,在同一输入上两次使用vshufps
会比较慢.
So using vshufps
with the same input twice is slower there.
但是在Intel的主流主流CPU上,使用vpermilps
-immediate相对于vshufps
是错失了优化,除非您可以将其与内存源一起使用. vshufps
将需要两次相同的内存源,这显然是无法编码的.
But on Intel's big-core mainstream CPUs, yes using vpermilps
-immediate is a missed-optimization vs. vshufps
, unless you can use it with a memory source. vshufps
would need the same memory source twice, which obviously isn't encodeable.
AVX的设计比KNL提前了数年,但也许ISA设计师记住了,将来某些CPU可能会通过更简单的改组来提高效率.
常规Silvermont(KNL所基于的无序Atom)不支持AVX,但对于shufps
,它具有1 uop/1个周期的吞吐量和延迟. Goldmont对于shufps
的吞吐量为0.5c.
Regular Silvermont (out-of-order Atom that KNL is based on) doesn't support AVX, but it has 1 uop / 1-cycle throughput and latency for shufps
. Goldmont has 0.5c throughput for shufps
.
AFAIK,英特尔仍未使用AVX制造低功耗内核(至强融核除外).我不认为他们打算与 Tremont 或Gracemont作继承人Goldmont Plus.
AFAIK, Intel still hasn't made a low-power core (other than Xeon Phi) with AVX. I don't think they're planning to with Tremont or Gracemont, successors to Goldmont Plus.
这篇关于VPERMILPS指令(_mm_permute_ps)有什么意义?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!