performance - 将BCD打包到DPD : How to improve this amd64 assembly routine?

我正在编写一个例程，以便在BCD(每十进制数字4位)和Densely Packed Decimal (DPD)(每3个十进制数字10位)之间转换。在Mike Cowlishaw's web site上进一步记录了DPD(建议使用软件使用查找表)。

该例程只需要它使用的寄存器的低16位，但是对于较短的指令编码，我已尽可能使用32位指令。与以下代码相关的速度损失:

mov data,%eax # high 16 bit of data are cleared
...
shl %al
shr %eax

或者

and $0x888,%edi         #   = 0000 a000 e000 i000
imul $0x0490,%di        #   = aei0 0000 0000 0000

16位imul的替代方案是32位imul和后续的and或一系列lea指令和最终的and。

我的例程中的整个代码可以在下面找到。在其中是否有什么东西比我将单词和dword指令混合使用会导致性能更差？

        .section .text
        .type bcd2dpd_mul,@function
        .globl bcd2dpd_mul

        # convert BCD to DPD with multiplication tricks
        # input abcd efgh iklm in edi
        .align 8
bcd2dpd_mul:
        mov %edi,%eax           #   = 0000 abcd efgh iklm
        shl %al                 #   = 0000 abcd fghi klm0
        shr %eax                #   = 0000 0abc dfgh iklm
        test $0x880,%edi        # fast path for a = e = 0
        jz 1f

        and $0x888,%edi         #   = 0000 a000 e000 i000
        imul $0x0490,%di        #   = aei0 0000 0000 0000
        mov %eax,%esi
        and $0x66,%esi          # q = 0000 0000 0fg0 0kl0
        shr $13,%edi            # u = 0000 0000 0000 0aei
        imul tab-8(,%rdi,4),%si # v = q * tab[u-2][0]
        and $0x397,%eax         # r = 0000 00bc d00h 0klm
        xor %esi,%eax           # w = r ^ v
        or tab-6(,%rdi,4),%ax   # x = w | tab[u-2][1]
        and $0x3ff,%eax         #   = 0000 00xx xxxx xxxx
1:      ret

        .size bcd2dpd_mul,.-bcd2dpd_mul

        .section .rodata
        .align 4
tab:
        .short 0x0011 ; .short 0x000a
        .short 0x0000 ; .short 0x004e
        .short 0x0081 ; .short 0x000c
        .short 0x0008 ; .short 0x002e
        .short 0x0081 ; .short 0x000e
        .short 0x0000 ; .short 0x006e
        .size tab,.-tab

改进代码

在应用了答案和注释中的一些建议以及其他一些技巧之后，这是我的改进代码。

        .section .text
        .type bcd2dpd_mul,@function
        .globl bcd2dpd_mul

        # convert BCD to DPD with multiplication tricks
        # input abcd efgh iklm in edi
        .align 8
bcd2dpd_mul:
        mov %edi,%eax           #   = 0000 abcd efgh iklm
        shl %al                 #   = 0000 abcd fghi klm0
        shr %eax                #   = 0000 0abc dfgh iklm
        test $0x880,%edi        # fast path for a = e = 0
        jnz 1f
        ret

        .align 8
1:      and $0x888,%edi         #   = 0000 a000 e000 i000
        imul $0x49,%edi         #   = 0ae0 aei0 ei00 i000
        mov %eax,%esi
        and $0x66,%esi          # q = 0000 0000 0fg0 0kl0
        shr $8,%edi             #   = 0000 0000 0ae0 aei0
        and $0xe,%edi           #   = 0000 0000 0000 aei0
        movzwl lookup-4(%rdi),%edx
        movzbl %dl,%edi
        imul %edi,%esi          # v = q * tab[u-2][0]
        and $0x397,%eax         # r = 0000 00bc d00h 0klm
        xor %esi,%eax           # w = r ^ v
        or %dh,%al              #   = w | tab[u-2][1]
        and $0x3ff,%eax         #   = 0000 00xx xxxx xxxx
        ret

        .size bcd2dpd_mul,.-bcd2dpd_mul

        .section .rodata
        .align 4
lookup:
        .byte 0x11
        .byte 0x0a
        .byte 0x00
        .byte 0x4e
        .byte 0x81
        .byte 0x0c
        .byte 0x08
        .byte 0x2e
        .byte 0x81
        .byte 0x0e
        .byte 0x00
        .byte 0x6e
        .size lookup,.-lookup

最佳答案

(我将BMI2版本分为一个单独的答案，因为它最终可能会完全不同)

在查看完您使用该imul/shr进行的操作以获取表索引之后，我可以看到在哪里可以使用BMI2 pextr替换and/imul/shr或BMI1 bextr仅替换shr(允许使用imul32而不是imul16，因为您d仅提取所需的位，而不需要从上16移零)。有些AMD CPU带有BMI1，但即使压路机也缺少BMI2。英特尔与Haswell同时推出了BMI1和BMI2。

您可以使用64位pextr一次处理两个或四个16位字。但不是整个算法:您不能进行4个并行表查找。 (在这里不值得使用AVX2 VPGATHERDD。)实际上，您可以使用pshufb来实现索引最大为4位的LUT，请参见下文。

次要改进版本:

.section .rodata
  # This won't won't assemble, written this way for humans to line up with comments.

extmask_lobits:     .long           0b0000 0111 0111 0111
extmask_hibits:     .long           0b0000 1000 1000 1000

# pext doesn't have an immediate-operand form, but it can take the mask from a memory operand.
# Load these into regs if running in a tight loop.

#### TOTALLY UNTESTED #####
.text
.p2align 4,,10
bcd2dpd_bmi2:
#       mov   %edi,%eax           #   = 0000 abcd efgh iklm
#       shl   %al                 #   = 0000 abcd fghi klm0
#       shr   %eax                #   = 0000 0abc dfgh iklm

        pext  extmask_lobits, %edi, %eax
                                #   = 0000 0abc dfgh iklm
        mov   %eax, %esi        # insn scheduling for 4-issue front-end: Fast-path is 4 fused-domain uops
          # And doesn't waste issue capacity when we're taking the slow path.  CPUs with mov-elimination won't waste execution units from issuing an extra mov
        test  $0x880, %edi        # fast path for a = e = 0
        jnz .Lslow_path
        ret

.p2align 4
.Lslow_path:
        # 8 uops, including the `ret`: can issue in 2 clocks.

        # replaces and/imul/shr
        pext  extmask_hibits, %edi, %edi #u= 0000 0000 0000 0aei
        and   $0x66, %esi                # q = 0000 0000 0fg0 0kl0
        imul  tab-8(,%rdi,4), %esi       # v = q * tab[u-2][0]
        and   $0x397, %eax               # r = 0000 00bc d00h 0klm
        xor   %esi, %eax                 # w = r ^ v
        or    tab-6(,%rdi,4), %eax       # x = w | tab[u-2][1]
        and   $0x3ff, %eax               #   = 0000 00xx xxxx xxxx
        ret

当然，如果使它成为一个内联汇编，而不是一个独立的函数，那么您将回到分支的快速路径，而慢速路径将通过。而且您也不会浪费对齐填充中间功能的空间。

对于其他功能的更多部分，使用pextr和/或pdep可能会有更大的余地。

我正在考虑如何使用BMI2做得更好。我认为我们可以从打包到64b的四个短裤中获得多个aei选择器，然后使用pdep将它们存储在不同字节的低位中。然后将movq编码到向量寄存器，在其中将其用作pshufb的随机控制掩码，以执行多个4位LUT查找。

因此，我们可以一次从60个BCD比特变为50个DPD比特。 (使用shrd在寄存器之间移动位以处理对字节可寻址存储器的加载/存储。)

实际上，使用Bdep将48个BCD位(每组4组，每组12位)-> 40个DPD位容易得多，因为您可以在64b整数寄存器中将其解压缩为4组16位。处理5个组的选择器很好，您可以解压缩pmovzx，但是处理其余数据将需要在向量寄存器中进行位改组。甚至慢速的AVX2可变位移insns都不会很容易做到这一点。 (尽管完全考虑如何使用BMI2来实现这一点很有趣，但对于仅使用SSSE3(即每个相关CPU)或SSE4.1的CPU来说，则可以大大提高速度。)

这也意味着我们可以将4组的两个簇放入128b寄存器的低半部分和高半部分中，以获得更多的并行度。

另外，48bits是一个整数字节，因此从BCD数字缓冲区中读取数据不需要任何shrd insns来将剩余的4位从最后64b转换为下一个低4位。或者，当4个被忽略的位是64b的低4或高4时，可以使用两个偏移pextr掩码工作。无论如何，我认为一次进行5组操作是不值得考虑的。

完整的BMI2/AVX pshufb LUT版本(可矢量化)

数据移动可能是:

ignored | group 3        | group 2        | group 1        |  group 0
16bits  | abcd efgh iklm | abcd efgh iklm | abcd efgh iklm | abcd efgh iklm

         3   2   1 | 0
pext -> aei|aei|aei|aei  # packed together in the low bits

          2  |      1            |        0
pdep ->  ... |0000 0000 0000 0aei|0000 0000 0000 0aei  # each in a separate 16b word

movq -> xmm vector register.
 (Then pinsrq another group of 4 selectors into the upper 64b of the vector reg).  So the vector part can handle 2 (or AVX2: 4) of this at once

vpshufb xmm2 -> map each byte to another byte (IMUL table)
vpshufb xmm3 -> map each byte to another byte (OR table)


Get the bits other than `aei` from each group of 3 BCD digits unpacked from 48b to 64b, into separate 16b words:

                  group 3       | group 2             | group 1             |  group 0
pdep(src)-> 0000 abcd efgh iklm | 0000 abcd efgh iklm | 0000 abcd efgh iklm | 0000 abcd efgh iklm

 movq this into a vector reg (xmm1).  (And repeat for the next 48b and pinsrq that to the upper64)

VPAND  xmm1, mask  (to zero aei in each group)

 Then use the vector-LUT results:
VPMULLW xmm1, xmm2 -> packed 16b multiply, keeping only the low16 of the result

VPAND   xmm1,  mask
VPXOR   xmm1, something
VPOR    xmm1, xmm3

movq / pextrq back to integer regs

pext to pack the bits back together
  You don't need the AND 0x3ff or equivalent:
  Those bits go away when you pext to pack each 16b down to 10b

shrd or something to pack the 40b results of this into 64b chunks for store to memory.
  Or: 32b store, then shift and store the last 8b, but that seems lame
  Or: just do 64b stores, overlapping with the previous.  So you write 24b of garbage every time.  Take care at the very end of the buffer.

使用128b SSE指令的AVX 3操作数版本来避免需要movdqa不覆盖pshufb的表。只要您从不运行256b AVX指令，就无需弄乱vzeroupper。不过，如果使用任何矢量指令，也可以使用v(VEX)版本。在VM内部，您可能正在具有BMI2但不支持AVX的虚拟CPU上运行，因此可能是这样。检查两个CPU功能标志还是一个好主意，而不是如果看到BMI2则假定为AVX(即使对于当前存在的所有物理硬件来说都是安全的)。

这看起来真的很有效。即使您没有BMI2 pext/pdep来进行位打包/拆包，也可能值得在向量reg中进行mul/xor/等操作。我猜您可以使用现有的非BMI标量路由之类的代码来获取选择器，并进行掩码/移位/或可以将非选择器数据构建为16b块。还是shrd用于将数据从一个reg转移到另一个reg？

关于performance - 将BCD打包到DPD : How to improve this amd64 assembly routine?，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/34111959/