问题描述
我编写了很多punpckl,pextrd和pinsrd,它们将8x8字节矩阵旋转为更大的例程的一部分,该例程通过循环旋转来旋转B/W图像.
I've code with a lot of punpckl, pextrd and pinsrd that rotates a 8x8 byte matrix as part of a larger routine that rotates a B/W image with looptiling.
我使用IACA对其进行了分析,以查看是否值得为此使用AVX2例程,令人惊讶的是,Haswell/Skylake上的代码速度几乎是IVB上的两倍(IVB:19.8,HSW,SKL:36个周期). (使用iaca 2.1的IVB + HSW,使用3.0的skl,但使用3.0的hsw给出相同的数字)
I profiled it with IACA to see if it was worth doing a AVX2 routine for, and surprisingly the code is almost twice times as slow on Haswell/Skylake than on IVB (IVB:19.8, HSW,SKL: 36 cycles). (IVB+HSW using iaca 2.1, skl using 3.0, but hsw gives same number with 3.0)
从IACA输出中,我想区别是IVB使用端口1和5进行上述说明,而haswell仅使用端口5.
From IACA output I guess the difference is that IVB uses port 1 and 5 for above instructions, while haswell only uses port 5.
我在Google上搜索了一下,但找不到解释. Haswell使用旧版SSE真的会变慢吗,还是我遇到了一些极端情况?躲避此项目符号的任何建议(AVX2除外,AVX2是一个已知选项,但由于将工具链更新推迟到现在而推迟到新版本)
I googled a bit, but couldn't find a explanation. Is haswell really slower with legacy SSE, or did I just hit some extreme corner case? Any suggestions to dodge this bullet (other than AVX2, which is a known option but due to updating toolchain to new version postponed for now)
也欢迎一般性评论或建议的改进.
General remarks or suggested improvements are also welcome.
// r8 and r9 are #bytes to go to the next line in resp. src and dest
// r12=3*r8 r13=3*r9
// load 8x8 bytes into 4 registers, bytes interleaved.
movq xmm1,[rcx]
movq xmm4,[rcx+2*r8]
PUNPCKLBW xmm1,xmm4 // 0 2 0 2 0 2
movq xmm7,[rcx+r8]
movq xmm6,[rcx+r12]
PUNPCKLBW xmm7,xmm6 // 1 3 1 3 1 3
movdqa xmm2,xmm1
punpcklbw xmm1,xmm7 // 0 1 2 3 0 1 2 3 in xmm1:xmm2
punpckhbw xmm2,xmm7
lea rcx,[rcx+4*r8]
// same for 4..7
movq xmm3,[rcx]
movq xmm5,[rcx+2*r8]
PUNPCKLBW xmm3,xmm5
movq xmm7,[rcx+r8]
movq xmm8,[rcx+r12]
PUNPCKLBW xmm7,xmm8
movdqa xmm4,xmm3
punpcklbw xmm3,xmm7
punpckhbw xmm4,xmm7
// now we join one "low" dword from XMM1:xmm2 with one "high" dword
// from XMM3:xmm4
movdqa xmm5,xmm1
pextrd eax,xmm3,0
pinsrd xmm5,eax,1
movq [rdx],xmm5
movdqa xmm5,xmm3
pextrd eax,xmm1,1
pinsrd xmm5,eax,0
movq [rdx+r9],xmm5
movdqa xmm5,xmm1
pextrd eax,xmm3,2
pinsrd xmm5,eax,3
MOVHLPS xmm6,xmm5
movq [rdx+2*r9],xmm6
movdqa xmm5,xmm3
pextrd eax,xmm1,3
pinsrd xmm5,eax,2
MOVHLPS xmm6,xmm5
movq [rdx+r13],xmm6
lea rdx,[rdx+4*r9]
movdqa xmm5,xmm2
pextrd eax,xmm4,0
pinsrd xmm5,eax,1
movq [rdx],xmm5
movdqa xmm5,xmm4
pextrd eax,xmm2,1
pinsrd xmm5,eax,0
movq [rdx+r9],xmm5
movdqa xmm5,xmm2
pextrd eax,xmm4,2
pinsrd xmm5,eax,3
MOVHLPS xmm6,xmm5
movq [rdx+2*r9],xmm6
movdqa xmm5,xmm4
pextrd eax,xmm2,3
pinsrd xmm5,eax,2
MOVHLPS xmm6,xmm5
movq [rdx+r13],xmm6
lea rdx,[rdx+4*r9]
目的:它实际上是旋转摄像机拍摄的图像以达到图像视觉目的.在某些(较重)的应用程序中,旋转被推迟并仅显示(opengl),在某些情况下,旋转输入然后适应算法更容易.
purpose:It is really rotating images from a camera for image vision purposes . In some (heavier)apps the rotation is postponed and done display-only (opengl), in some it is easier to rotate input then to adapt algorithms.
更新后的代码:我在此处发布了一些最终代码a>加速很大程度上取决于输入的大小.在小图像上较大,但与32x32瓦片的HLL代码环回相比,在大图像上仍要大两倍. (与链接的asm代码算法相同)
updated code: I posted some final code here Speedup was very dependent on size of input. Large on small images, but still a factor two on larger ones compared to looptiling HLL code with a 32x32 tile. (same algo as asm code linked)
推荐答案
TL:DR:在重新排列单词的步骤中使用punpckl/hdq
保存大量的混音,就像转置一样使用SSE更好的8x8字节矩阵转置代码?/a>
TL:DR: use punpckl/hdq
to save a massive number of shuffles in the dword-rearranging step, exactly like the transpose code in A better 8x8 bytes matrix transpose with SSE?
您的内存布局要求分别存储每个矢量结果的低/高8个字节,您可以使用movq [rdx], xmm
/movhps [rdx+r9], xmm
有效地做到这一点.
Your memory layout requires storing the low / high 8 bytes of each vector result separately, which you can do efficiently with movq [rdx], xmm
/ movhps [rdx+r9], xmm
.
您的代码严重限制洗牌吞吐量.
Your code bottlenecks heavily on shuffle throughput.
在端口5上,Haswell仅具有一个Shuffle执行单元.SnB/IvB具有2个整数Shuffle单元(但仍然只有一个FP Shuffle单元).参见 Agner Fog的说明表和优化指南/microarch指南.
Haswell only has one shuffle execution unit, on port 5. SnB/IvB have 2 integer shuffle units (but still only one FP shuffle unit). See Agner Fog's instruction tables and optimization guide / microarch guide.
我看到您已经发现David Kanter出色的 Haswell微架构文章.
I see you already found David Kanter's excellent Haswell microarch write-up.
对于这样的代码,很容易在shuffle(或一般来说是port5)的吞吐量上造成瓶颈,而对于AVX/AVX2,由于许多shuffle只在车道内,因此它通常会变得更糟.用于128位操作的AVX可能会有所帮助,但我认为您不会从改组为256b向量然后再将其改组为64位块中获得任何收益.如果您可以加载或存储连续的256b块,则值得尝试.
It's very easy to bottleneck on shuffle (or port5 in general) throughput for code like this, and it often gets worse with AVX / AVX2 because many shuffles are in-lane only. AVX for 128-bit ops might help, but I don't think you'll gain anything from shuffling into 256b vectors and then shuffling them apart again into 64-bit chunks. If you could load or store contiguous 256b chunks, it would be worth trying.
即使在我们考虑重大更改之前,您仍可以进行一些简单的错过优化:
You have some simple missed-optimizations, even before we think about major changes:
MOVHLPS xmm6,xmm5
movq [rdx+r13],xmm6
应为movhps [rdx+r13],xmm6
.在Sandybridge和Haswell上,movhps
是纯存储uop,不需要洗牌uop.
should be movhps [rdx+r13],xmm6
. On Sandybridge and Haswell, movhps
is a pure store uop, with no shuffle uop required.
pextrd eax,xmm3,0
总是比movd eax, xmm3
差;永远不要使用带立即数0的pextrd
.(另外,直接将pextrd
存入内存可能是一个胜利.您可以执行64位的movq
,然后用32位的pextrd
覆盖一半.然后,可能会影响商店吞吐量的瓶颈.另外,在Sandybridge上,索引寻址模式也不会保持微融合. ,因此更多的商店会损害您的总uop吞吐量.但是Haswell的商店并没有这个问题,仅取决于指令的某些索引负载.您可以在单寄存器寻址模式下使用更多存储.
pextrd eax,xmm3,0
is always worse than movd eax, xmm3
; never use pextrd
with an immediate 0. (Also, pextrd
directly to memory might be a win. You could do a 64-bit movq
and then overwrite half of that with a 32-bit pextrd
. You might then bottleneck on store throughput. Also, on Sandybridge, indexed addressing modes don't stay micro-fused, so more stores would hurt your total uop throughput. But Haswell doesn't have that problem for stores, only for some indexed loads depending on the instruction.) If you use more stores some places and more shuffles other places, you could use more stores for the single-register addressing modes.
取决于您在做什么. x264(开源h.264视频编码器)在重复使用它们之前将8x8块复制到连续缓冲区中,因此行之间的跨度是汇编时间常数.
Depends what you're doing. x264 (the open-source h.264 video encoder) copies 8x8 blocks into contiguous buffers before working with them repeatedly, so the stride between rows is an assemble-time constant.
这样可以节省跨步进入寄存器并像执行[rcx+2*r8]
/[rcx+r8]
一样进行操作.它还允许您使用一个movdqa
加载两行.而且,它为访问8x8块提供了良好的内存局部性.
This saves passing a stride in a register and doing stuff like you're doing with [rcx+2*r8]
/ [rcx+r8]
. It also lets you load two rows with one movdqa
. And it gives you good memory locality for accessing 8x8 blocks.
当然,如果旋转是您使用8x8像素块进行的 all 操作,那么花时间复制进/出这种格式可能不是赢家. FFmpeg的h.264解码器(使用了许多与x264相同的asm原语)不使用此功能,但使用IDK是因为没有人愿意移植更新的x264 asm或不值得.
Of course it's probably not a win to spend time copying in/out of this format if this rotation is all you're doing with an 8x8 block of pixels. FFmpeg's h.264 decoder (which uses many of of the same asm primitives as x264) doesn't use this, but IDK if that's because nobody ever bothered to port the updated x264 asm or if it's just not worth it.
// now we join one "low" dword from XMM1:xmm2 with one "high" dword
// from XMM3:xmm4
从整数中提取/插入不是很有效; pinsrd
和pextrd
分别为2微码,而这些微码之一是随机播放.您甚至可能只使用pextrd
以32位块的形式存储内存,就可以领先于当前代码.
extract/insert to/from integer is not very efficient; pinsrd
and pextrd
are 2 uops each, and one of those uops is a shuffle. You might even come out ahead of your current code just using pextrd
to memory in 32-bit chunks.
也可以考虑使用SSSE3 pshufb
,它可以将您的数据放置在任何需要的地方,而其他元素为零.这可以设置您与por
合并. (您可以使用pshufb
代替punpcklbw
).
Also consider using SSSE3 pshufb
which can put your data anywhere it needs to be, and zero other elements. This can set you up for merging with por
. (You might use pshufb
instead of punpcklbw
).
另一个选择是使用shufps
合并来自两个来源的数据.之后,您可能需要重新洗牌. 或使用punpckldq
.
Another option is to use shufps
to combine data from two sources. You might need another shuffle after that. Or use punpckldq
.
// "low" dwords from XMM1:xmm2
// high dwords from XMM3:xmm4
; xmm1: [ a b c d ] xmm2: [ e f g h ]
; xmm3: [ i j k l ] xmm4: [ m n o p ]
; want: [ a i b j ] / [ c k d l ] / ... I think.
;; original: replace these with
; movdqa xmm5,xmm1 ; xmm5 = [ a b c d ]
; pextrd eax,xmm3,0 ; eax = i
; pinsrd xmm5,eax,1 ; xmm5 = [ a i ... ]
; movq [rdx],xmm5
; movdqa xmm5,xmm3 ; xmm5 = [ i j k l ]
; pextrd eax,xmm1,1 ; eax = b
; pinsrd xmm5,eax,0 ; xmm5 = [ b j ... ]
; movq [rdx+r9],xmm5
替换为:
movdqa xmm5, xmm1
punpckldq xmm5, xmm3 ; xmm5 = [ a i b j ]
movq [rdx], xmm5
movhps [rdx+r9], xmm5 ; still a pure store, doesn't cost a shuffle
因此,我们用1替换了4个shuffle uops,并将总的uop计数从12个融合域uops(Haswell)减少到4. -融合).
So we've replaced 4 shuffle uops with 1, and lowered the total uop count from 12 fused-domain uops (Haswell) to 4. (Or on Sandybridge, from 13 to 5 because the indexed store doesn't stay micro-fused).
对于[ c k d l ]
,请使用punpckhdq
,因为我们也将替换movhlps
,所以效果更好.
Use punpckhdq
for [ c k d l ]
, where it's even better because we're replacing movhlps
as well.
; movdqa xmm5,xmm1 ; xmm5 = [ a b c d ]
; pextrd eax,xmm3,2 ; eax = k
; pinsrd xmm5,eax,3 ; xmm5 = [ a b c k ]
; MOVHLPS xmm6,xmm5 ; xmm6 = [ c k ? ? ] (false dependency on old xmm6)
; movq [rdx+2*r9],xmm6
然后解压缩xmm2和xmm4的lo/hi.
Then unpack lo/hi for xmm2 and xmm4.
使用AVX或AVX2可以跳过movdqa
,因为您可以解压缩到新的目标寄存器中,而不用复制+销毁.
Using AVX or AVX2 would let you skip the movdqa
, because you can unpack into a new destination register instead of copy + destroy.
这篇关于SSE2 8x8字节矩阵转置代码在Haswell +上的速度是在Ivy Bridge上的两倍的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!