SSE2 8x8字节矩阵转置代码在Haswell +上的速度是在Ivy Bridge上的两倍

本文介绍了SSE2 8x8字节矩阵转置代码在Haswell +上的速度是在Ivy Bridge上的两倍的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我编写了很多punpckl，pextrd和pinsrd，它们将8x8字节矩阵旋转为更大的例程的一部分，该例程通过循环旋转来旋转B/W图像.

I've code with a lot of punpckl, pextrd and pinsrd that rotates a 8x8 byte matrix as part of a larger routine that rotates a B/W image with looptiling.

我使用IACA对其进行了分析，以查看是否值得为此使用AVX2例程，令人惊讶的是，Haswell/Skylake上的代码速度几乎是IVB上的两倍(IVB:19.8，HSW，SKL:36个周期). (使用iaca 2.1的IVB + HSW，使用3.0的skl，但使用3.0的hsw给出相同的数字)

I profiled it with IACA to see if it was worth doing a AVX2 routine for, and surprisingly the code is almost twice times as slow on Haswell/Skylake than on IVB (IVB:19.8, HSW,SKL: 36 cycles). (IVB+HSW using iaca 2.1, skl using 3.0, but hsw gives same number with 3.0)

从IACA输出中，我想区别是IVB使用端口1和5进行上述说明，而haswell仅使用端口5.

From IACA output I guess the difference is that IVB uses port 1 and 5 for above instructions, while haswell only uses port 5.

我在Google上搜索了一下，但找不到解释. Haswell使用旧版SSE真的会变慢吗，还是我遇到了一些极端情况?躲避此项目符号的任何建议(AVX2除外，AVX2是一个已知选项，但由于将工具链更新推迟到现在而推迟到新版本)

I googled a bit, but couldn't find a explanation. Is haswell really slower with legacy SSE, or did I just hit some extreme corner case? Any suggestions to dodge this bullet (other than AVX2, which is a known option but due to updating toolchain to new version postponed for now)

也欢迎一般性评论或建议的改进.

General remarks or suggested improvements are also welcome.

   // r8 and r9 are #bytes to go to the next line in resp. src and dest
   // r12=3*r8 r13=3*r9
  // load 8x8 bytes into 4 registers, bytes interleaved.
  movq xmm1,[rcx]
  movq xmm4,[rcx+2*r8]
  PUNPCKLBW xmm1,xmm4   // 0 2 0 2 0 2
  movq xmm7,[rcx+r8]
  movq xmm6,[rcx+r12]
  PUNPCKLBW xmm7,xmm6   // 1 3 1 3 1 3

  movdqa xmm2,xmm1
  punpcklbw xmm1,xmm7   // 0 1 2 3 0 1 2 3 in xmm1:xmm2
  punpckhbw xmm2,xmm7
  lea rcx,[rcx+4*r8]

  // same for 4..7

  movq xmm3,[rcx]
  movq xmm5,[rcx+2*r8]
  PUNPCKLBW xmm3,xmm5
  movq xmm7,[rcx+r8]
  movq xmm8,[rcx+r12]
  PUNPCKLBW xmm7,xmm8

  movdqa xmm4,xmm3
  punpcklbw xmm3,xmm7
  punpckhbw xmm4,xmm7

  // now we join one "low" dword from XMM1:xmm2 with one "high" dword
  // from XMM3:xmm4

  movdqa  xmm5,xmm1
  pextrd  eax,xmm3,0
  pinsrd  xmm5,eax,1
  movq    [rdx],xmm5

  movdqa  xmm5,xmm3
  pextrd  eax,xmm1,1
  pinsrd  xmm5,eax,0
  movq    [rdx+r9],xmm5

  movdqa  xmm5,xmm1
  pextrd  eax,xmm3,2
  pinsrd  xmm5,eax,3
  MOVHLPS  xmm6,xmm5
  movq    [rdx+2*r9],xmm6

  movdqa  xmm5,xmm3
  pextrd  eax,xmm1,3
  pinsrd  xmm5,eax,2
  MOVHLPS  xmm6,xmm5
  movq    [rdx+r13],xmm6

  lea     rdx,[rdx+4*r9]

  movdqa  xmm5,xmm2
  pextrd  eax,xmm4,0
  pinsrd  xmm5,eax,1
  movq    [rdx],xmm5

  movdqa  xmm5,xmm4
  pextrd  eax,xmm2,1
  pinsrd  xmm5,eax,0
  movq    [rdx+r9],xmm5

  movdqa  xmm5,xmm2
  pextrd  eax,xmm4,2
  pinsrd  xmm5,eax,3
  MOVHLPS  xmm6,xmm5
  movq    [rdx+2*r9],xmm6

  movdqa  xmm5,xmm4
  pextrd  eax,xmm2,3
  pinsrd  xmm5,eax,2
  MOVHLPS  xmm6,xmm5
  movq    [rdx+r13],xmm6

  lea     rdx,[rdx+4*r9]

目的:它实际上是旋转摄像机拍摄的图像以达到图像视觉目的.在某些(较重)的应用程序中，旋转被推迟并仅显示(opengl)，在某些情况下，旋转输入然后适应算法更容易.

purpose:It is really rotating images from a camera for image vision purposes . In some (heavier)apps the rotation is postponed and done display-only (opengl), in some it is easier to rotate input then to adapt algorithms.

更新后的代码:我在此处发布了一些最终代码a>加速很大程度上取决于输入的大小.在小图像上较大，但与32x32瓦片的HLL代码环回相比，在大图像上仍要大两倍. (与链接的asm代码算法相同)

updated code: I posted some final code here Speedup was very dependent on size of input. Large on small images, but still a factor two on larger ones compared to looptiling HLL code with a 32x32 tile. (same algo as asm code linked)

TL:DR: use punpckl/hdq to save a massive number of shuffles in the dword-rearranging step, exactly like the transpose code in A better 8x8 bytes matrix transpose with SSE?

您的内存布局要求分别存储每个矢量结果的低/高8个字节，您可以使用movq [rdx], xmm/movhps [rdx+r9], xmm有效地做到这一点.

Your memory layout requires storing the low / high 8 bytes of each vector result separately, which you can do efficiently with movq [rdx], xmm / movhps [rdx+r9], xmm.

您的代码严重限制洗牌吞吐量.

Your code bottlenecks heavily on shuffle throughput.

在端口5上，Haswell仅具有一个Shuffle执行单元.SnB/IvB具有2个整数Shuffle单元(但仍然只有一个FP Shuffle单元).参见 Agner Fog的说明表和优化指南/microarch指南.

Haswell only has one shuffle execution unit, on port 5. SnB/IvB have 2 integer shuffle units (but still only one FP shuffle unit). See Agner Fog's instruction tables and optimization guide / microarch guide.

我看到您已经发现David Kanter出色的 Haswell微架构文章.

I see you already found David Kanter's excellent Haswell microarch write-up.

对于这样的代码，很容易在shuffle(或一般来说是port5)的吞吐量上造成瓶颈，而对于AVX/AVX2，由于许多shuffle只在车道内，因此它通常会变得更糟.用于128位操作的AVX可能会有所帮助，但我认为您不会从改组为256b向量然后再将其改组为64位块中获得任何收益.如果您可以加载或存储连续的256b块，则值得尝试.

It's very easy to bottleneck on shuffle (or port5 in general) throughput for code like this, and it often gets worse with AVX / AVX2 because many shuffles are in-lane only. AVX for 128-bit ops might help, but I don't think you'll gain anything from shuffling into 256b vectors and then shuffling them apart again into 64-bit chunks. If you could load or store contiguous 256b chunks, it would be worth trying.

即使在我们考虑重大更改之前，您仍可以进行一些简单的错过优化:

You have some simple missed-optimizations, even before we think about major changes:

  MOVHLPS  xmm6,xmm5
  movq    [rdx+r13],xmm6

应为movhps [rdx+r13],xmm6.在Sandybridge和Haswell上，movhps是纯存储uop，不需要洗牌uop.

should be movhps [rdx+r13],xmm6. On Sandybridge and Haswell, movhps is a pure store uop, with no shuffle uop required.

pextrd eax,xmm3,0总是比movd eax, xmm3差；永远不要使用带立即数0的pextrd.(另外，直接将pextrd存入内存可能是一个胜利.您可以执行64位的movq，然后用32位的pextrd覆盖一半.然后，可能会影响商店吞吐量的瓶颈.另外，在Sandybridge上，索引寻址模式也不会保持微融合. ，因此更多的商店会损害您的总uop吞吐量.但是Haswell的商店并没有这个问题，仅取决于指令的某些索引负载.您可以在单寄存器寻址模式下使用更多存储.

pextrd eax,xmm3,0 is always worse than movd eax, xmm3; never use pextrd with an immediate 0. (Also, pextrd directly to memory might be a win. You could do a 64-bit movq and then overwrite half of that with a 32-bit pextrd. You might then bottleneck on store throughput. Also, on Sandybridge, indexed addressing modes don't stay micro-fused, so more stores would hurt your total uop throughput. But Haswell doesn't have that problem for stores, only for some indexed loads depending on the instruction.) If you use more stores some places and more shuffles other places, you could use more stores for the single-register addressing modes.

取决于您在做什么. x264(开源h.264视频编码器)在重复使用它们之前将8x8块复制到连续缓冲区中，因此行之间的跨度是汇编时间常数.

Depends what you're doing. x264 (the open-source h.264 video encoder) copies 8x8 blocks into contiguous buffers before working with them repeatedly, so the stride between rows is an assemble-time constant.

这样可以节省跨步进入寄存器并像执行[rcx+2*r8]/[rcx+r8]一样进行操作.它还允许您使用一个movdqa加载两行.而且，它为访问8x8块提供了良好的内存局部性.

This saves passing a stride in a register and doing stuff like you're doing with [rcx+2*r8] / [rcx+r8]. It also lets you load two rows with one movdqa. And it gives you good memory locality for accessing 8x8 blocks.

当然，如果旋转是您使用8x8像素块进行的 all 操作，那么花时间复制进/出这种格式可能不是赢家. FFmpeg的h.264解码器(使用了许多与x264相同的asm原语)不使用此功能，但使用IDK是因为没有人愿意移植更新的x264 asm或不值得.

Of course it's probably not a win to spend time copying in/out of this format if this rotation is all you're doing with an 8x8 block of pixels. FFmpeg's h.264 decoder (which uses many of of the same asm primitives as x264) doesn't use this, but IDK if that's because nobody ever bothered to port the updated x264 asm or if it's just not worth it.

  // now we join one "low" dword from XMM1:xmm2 with one "high" dword
  // from XMM3:xmm4

从整数中提取/插入不是很有效； pinsrd和pextrd分别为2微码，而这些微码之一是随机播放.您甚至可能只使用pextrd以32位块的形式存储内存，就可以领先于当前代码.

extract/insert to/from integer is not very efficient; pinsrd and pextrd are 2 uops each, and one of those uops is a shuffle. You might even come out ahead of your current code just using pextrd to memory in 32-bit chunks.

也可以考虑使用SSSE3 pshufb ，它可以将您的数据放置在任何需要的地方，而其他元素为零.这可以设置您与por合并. (您可以使用pshufb代替punpcklbw).

Also consider using SSSE3 pshufb which can put your data anywhere it needs to be, and zero other elements. This can set you up for merging with por. (You might use pshufb instead of punpcklbw).

另一个选择是使用shufps合并来自两个来源的数据.之后，您可能需要重新洗牌. 或使用punpckldq .

Another option is to use shufps to combine data from two sources. You might need another shuffle after that. Or use punpckldq.

// "low" dwords from XMM1:xmm2
//  high dwords from XMM3:xmm4

;  xmm1:  [ a b c d ]   xmm2: [ e f g h ]
;  xmm3:  [ i j k l ]   xmm4: [ m n o p ]

; want: [ a i b j ] / [ c k d l ] / ... I think.

;; original: replace these with
;  movdqa  xmm5,xmm1     ; xmm5 = [ a b c d ]
;  pextrd  eax,xmm3,0    ; eax = i
;  pinsrd  xmm5,eax,1    ; xmm5 = [ a i ... ]
;  movq    [rdx],xmm5

;  movdqa  xmm5,xmm3       ; xmm5 = [ i j k l ]
;  pextrd  eax,xmm1,1      ; eax = b
;  pinsrd  xmm5,eax,0      ; xmm5 = [ b j ... ]
;  movq    [rdx+r9],xmm5

替换为:

   movdqa    xmm5, xmm1
   punpckldq xmm5, xmm3     ; xmm5 = [ a i b j ]
   movq     [rdx], xmm5
   movhps   [rdx+r9], xmm5  ; still a pure store, doesn't cost a shuffle

因此，我们用1替换了4个shuffle uops，并将总的uop计数从12个融合域uops(Haswell)减少到4. -融合).

So we've replaced 4 shuffle uops with 1, and lowered the total uop count from 12 fused-domain uops (Haswell) to 4. (Or on Sandybridge, from 13 to 5 because the indexed store doesn't stay micro-fused).

对于[ c k d l ]，请使用punpckhdq，因为我们也将替换movhlps，所以效果更好.

Use punpckhdq for [ c k d l ], where it's even better because we're replacing movhlps as well.

 ;  movdqa  xmm5,xmm1       ; xmm5 = [ a b c d ]
 ; pextrd  eax,xmm3,2      ; eax = k
 ; pinsrd  xmm5,eax,3      ; xmm5 = [ a b c k ]
 ; MOVHLPS  xmm6,xmm5      ; xmm6 = [ c k ? ? ]  (false dependency on old xmm6)
 ; movq   [rdx+2*r9],xmm6

然后解压缩xmm2和xmm4的lo/hi.

Then unpack lo/hi for xmm2 and xmm4.

使用AVX或AVX2可以跳过movdqa，因为您可以解压缩到新的目标寄存器中，而不用复制+销毁.

Using AVX or AVX2 would let you skip the movdqa, because you can unpack into a new destination register instead of copy + destroy.

这篇关于SSE2 8x8字节矩阵转置代码在Haswell +上的速度是在Ivy Bridge上的两倍的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

chunks

SSE2 8x8字节矩阵转置代码在Haswell +上的速度是在Ivy Bridge上的两倍

问题描述

推荐答案