问题描述
给出寄存器中的数字(二进制整数),如何将其转换为十六进制ASCII数字字符串? (即,将其序列化为文本格式.)
Given a number in a register (a binary integer), how to convert it to a string of hexadecimal ASCII digits? (i.e. serialize it into a text format.)
数字既可以存储在内存中,也可以即时打印,但是通常将它们存储在内存中并立即打印会更有效. (您可以修改存储的循环,以一次只打印一个.)
Digits can be stored in memory or printed on the fly, but storing in memory and printing all at once is usually more efficient. (You can modify a loop that stores to instead print one at a time.)
我们可以与SIMD并行有效地处理所有半字节吗? (SSE2或更高版本?)
Can we efficiently handle all the nibbles in parallel with SIMD? (SSE2 or later?)
推荐答案
相关: 16位版本,它将1字节转换为2个十六进制数字,您可以将其打印或存储到缓冲区中.并且将bin转换为十六进制组装还有另一个16位版本,其中包含很多答案的一半用文字解释,涵盖问题的int->十六进制字符串部分.
related: 16-bit version that converts 1 byte to 2 hex digits which you could print or store to a buffer. And Converting bin to hex in assembly has another 16-bit version with plenty of text explanation in the half of the answer that covers the int -> hex-string part of the problem.
16是2的幂.不同于十进制或其他不是2的幂的基数,我们不需要除法,并且我们可以首先提取最高有效位(即,按打印顺序).否则,我们只能先获取最低有效数字(这取决于数字的所有位),然后我们必须倒退:请参阅如何在不使用c库的printf的情况下在汇编级编程中打印整数?用于非2的幂的幂.
16 is a power of 2. Unlike decimal or other bases that aren't a power of 2, we don't need division, and we can extract the most-significant digit first (i.e. in printing order). Otherwise we can only get the least-significant digit first (and it depends on all bits of the number) and we have to go backwards: see How do I print an integer in Assembly Level Programming without printf from the c library? for non-power-of-2 bases.
每个4位位组映射到一个十六进制数字.我们可以使用移位或旋转以及AND掩码将输入的每个4位块提取为4位整数.
(此答案将转换为带有前导零的十六进制.如果要删除它们,请对输出字符串上的tzcnt
或__builtin_ctz
进行位扫描(输入)/4,如tzcnt
或__builtin_ctz
或SIMD比较-> pmovmksb-> tzcnt会告诉您您有多少个0位数字.或者从低四位开始转换并向后工作,直到右移使值变为零时停止.)
(This answer converts to hex with leading zeros. If you want to drop them, bit-scan(input)/4 like tzcnt
or __builtin_ctz
, or SIMD compare -> pmovmksb -> tzcnt on the output string will tell you how many 0 digits you have. Or convert starting with the low nibble and work backwards, stopping when a right shift makes the value zero.).
很遗憾,ASCII字符集中的0..9 a..f十六进制数字不连续( http://www.asciitable.com/).我们要么需要条件行为(分支或cmov),要么可以使用查找表.查找表通常是最有效的指令计数和性能.现代CPU具有非常快的L1d高速缓存,这使得附近字节的重复加载非常便宜.流水线/无序执行隐藏了L1d缓存负载的〜5个周期延迟.
Unfortunately the 0..9 a..f hex digits are not contiguous in the ASCII character set (http://www.asciitable.com/). We either need conditional behaviour (a branch or cmov), or we can use a lookup table. A lookup table is typical the most efficient for instruction count and performance; modern CPUs have very fast L1d caches that make repeated loads of nearby bytes very cheap. Pipelined / out-of-order execution hides the ~5 cycle latency of an L1d cache load.
;; NASM syntax, i386 System V calling convention
global itohex
itohex: ; inputs: char* output, unsigned number
push edi ; save a call-preserved register for scratch space
mov edi, [esp+8] ; out pointer
mov eax, [esp+12] ; number
mov ecx, 8 ; 8 hex digits, fixed width zero-padded
.digit_loop: ; do {
rol eax, 4 ; rotate the high 4 bits to the bottom
mov edx, eax
and edx, 0x0f ; and isolate 4-bit integer in EDX
movzx edx, byte [hex_lut + edx]
mov [edi], dl ; copy a character from the lookup table
inc edi ; loop forward in the output buffer
dec ecx
jnz .digit_loop ; }while(--ecx)
pop edi
ret
section .rodata
hex_lut: db "0123456789abcdef"
在BMI2(shrx
/rorx
)之前,x86缺少复制和移位指令,因此就地旋转然后复制/AND很难胜过.现代的x86(Intel和AMD)具有1个周期的旋转延迟( https://agner.org/optimize/),因此此循环承载的依赖关系链不会成为瓶颈. (循环中有太多指令,即使在5宽Ryzen上,每个迭代甚至无法运行1个周期.)
Until BMI2 (shrx
/ rorx
), x86 lacks a copy-and-shift instruction, so rotating in-place and then copy/AND is hard to beat. Modern x86 (Intel and AMD) has 1-cycle latency for rotates (https://agner.org/optimize/), so this loop-carried dependency chain doesn't become a bottleneck. (There are too many instructions in the loop for it to run at even 1 cycle per iteration even on 5-wide Ryzen.)
即使我们通过使用带有末端指针的cmp / jb
在Ryzen上启用cmp/jcc融合进行了优化,它仍然是7 oups,比管道在1个周期内可以处理的更多. dec/jcc
宏融合到单个uop中只会在Intel Sandybridge系列上发生; AMD仅将cmp融合或与jcc进行测试.我使用mov ecx,8
和dec/jnz来提高可读性; lea ecx, [edi+8]
和cmp edi, ecx / jb .digit_loop
总体上较小,并且在更多CPU上效率更高.
Even if we optimized by using a cmp / jb
with an end pointer to enable cmp/jcc fusion on Ryzen, it's still 7 uops, more than the pipeline can handle in 1 cycle. dec/jcc
macro-fusion into a single uop only happens on Intel Sandybridge-family; AMD only fuses cmp or test with jcc. I used mov ecx,8
and dec/jnz for human readability; lea ecx, [edi+8]
and cmp edi, ecx / jb .digit_loop
is smaller overall, and more efficient on more CPUs.
脚注1:我们可能会在移位前使用SWAR(寄存器中的SIMD)进行AND:x & 0x0f0f0f0f
低半字节和shr(x,4) & 0x0f0f0f0f
高半字节,然后通过交替处理a每个寄存器的字节. (没有任何有效的方法来等效punpcklbw
或将整数映射到非连续的ASCII代码,我们仍然只需要分别处理每个字节.但是我们可能会展开字节提取并读取AH,然后读取AL(带有movzx
)来保存移位指令.读取高8位寄存器会增加延迟,但我认为在当前CPU上并不会花费额外的成本.在英特尔CPU上写入高8位寄存器通常是不好的:它需要额外的合并uop读取完整的寄存器,并且有一个前端延迟来插入它.因此,通过改组寄存器来扩大存储范围可能不好.在内核代码中,您不能使用XMM reg,但是可以使用BMI2(如果可用),可以将半字节扩展为字节,但这可能比仅掩盖2种方式还差.)
Footnote 1: We might use SWAR (SIMD within a register) to do the AND before shifting: x & 0x0f0f0f0f
low nibbles, and shr(x,4) & 0x0f0f0f0f
high nibbles, then effectively unroll by alternating processing a byte from each register. (Without any efficient way to do an equivalent of punpcklbw
or mapping integers to the non-contiguous ASCII codes, we do still just have to do each byte separately. But we might unroll the byte-extraction and read AH then AL (with movzx
) to save shift instructions. Reading high-8 registers can add latency, but I think it doesn't cost extra uops on current CPUs. Writing high-8 registers is usually not good on Intel CPUs: it costs an extra merging uop to read the full register, with a front-end delay to insert it. So getting wider stores by shuffling registers is probably not good. In kernel code where you can't use XMM regs, but could use BMI2 if available, pdep
could expand nibbles to bytes but this is probably worse than just masking 2 ways.)
测试程序:
// hex.c converts argv[1] to integer and passes it to itohex
#include <stdio.h>
#include <stdlib.h>
void itohex(char buf[8], unsigned num);
int main(int argc, char**argv) {
unsigned num = strtoul(argv[1], NULL, 0); // allow any base
char buf[9] = {0};
itohex(buf, num); // writes the first 8 bytes of the buffer, leaving a 0-terminated C string
puts(buf);
}
编译为:
nasm -felf32 -g -Fdwarf itohex.asm
gcc -g -fno-pie -no-pie -O3 -m32 hex.c itohex.o
测试运行:
$ ./a.out 12315
0000301b
$ ./a.out 12315123
00bbe9f3
$ ./a.out 999999999
3b9ac9ff
$ ./a.out 9999999999 # apparently glibc strtoul saturates on overflow
ffffffff
$ ./a.out 0x12345678 # strtoul with base=0 can parse hex input, too
12345678
替代实现:
视情况而定,而不是查找表:需要更多说明,而且速度可能会更慢.但这不需要任何静态数据.
Alternate implementations:
Conditional instead of lookup-table: takes several more instructions, and will probably be slower. But it doesn't need any static data.
可以用分支代替cmov
来完成,但这在大多数情况下甚至会更慢. (假设随机混合0..9和a..f数字,预测效果会不好.)显示了针对代码进行了优化的版本-尺寸. (除了开头的bswap
之外,它是正常的uint32_t->十六进制,填充为零.)
It could be done with branching instead of cmov
, but that would be even slower most of the time. (It won't predict well, assuming a random mix of 0..9 and a..f digits.) https://codegolf.stackexchange.com/questions/193793/little-endian-number-to-string-conversion/193842#193842 shows a version optimized for code-size. (Other than a bswap
at the start, it's a normal uint32_t -> hex with zero padding.)
只是为了好玩,此版本从缓冲区的末尾开始,并递减指针. (并且循环条件使用指针比较.)如果您不希望前导零,则可以使它在EDX变为零后停止,并使用EDI + 1作为数字的开头.
Just for fun, this version starts at the end of the buffer and decrements a pointer. (And the loop condition uses a pointer-compare.) You could have it stop once EDX becomes zero, and use EDI+1 as the start of the number, if you don't want leading zeros.
使用cmp eax,9
/ja
代替cmov
作为练习留给读者.它的16位版本可以使用不同的寄存器(例如BX作为临时寄存器)来仍然允许lea cx, [bx + 'a'-10]
复制和添加.如果您想避免cmov
与不支持P6扩展的古老CPU兼容,或者只是add
/cmp
和jcc
.
Using a cmp eax,9
/ ja
instead of cmov
is left as an exercise for the reader. A 16-bit version of this could use different registers (like maybe BX as a temporary) to still allow lea cx, [bx + 'a'-10]
copy-and-add. Or just add
/cmp
and jcc
, if you want to avoid cmov
for compat with ancient CPUs that don't support P6 extensions.
;; NASM syntax, i386 System V calling convention
itohex: ; inputs: char* output, unsigned number
itohex_conditional:
push edi ; save a call-preserved register for scratch space
push ebx
mov edx, [esp+16] ; number
mov ebx, [esp+12] ; out pointer
lea edi, [ebx + 7] ; First output digit will be written at buf+7, then we count backwards
.digit_loop: ; do {
mov eax, edx
and eax, 0x0f ; isolate the low 4 bits in EAX
lea ecx, [eax + 'a'-10] ; possible a..f value
add eax, '0' ; possible 0..9 value
cmp ecx, 'a'
cmovae eax, ecx ; use the a..f value if it's in range.
; for better ILP, another scratch register would let us compare before 2x LEA,
; instead of having the compare depend on an LEA or ADD result.
mov [edi], al ; *ptr-- = c;
dec edi
shr edx, 4
cmp edi, ebx ; alternative: jnz on flags from EDX to not write leading zeros.
jae .digit_loop ; }while(ptr >= buf)
pop ebx
pop edi
ret
在每次迭代中,我们可以使用2x lea
+ cmp/cmov
公开更多的ILP. cmp和两个LEA仅取决于半字节值,cmov
将消耗所有这三个结果.但是在迭代过程中有很多ILP,其中只有shr edx,4
和指针递减作为循环承载的依赖项.我本可以通过安排节省1个字节的代码大小,所以可以使用cmp al, 'a'
之类的东西.和/或add al,'0'
,如果我不关心从EAX单独重命名AL的CPU.
We could expose even more ILP within each iteration using 2x lea
+ cmp/cmov
. cmp and both LEAs only depend on the nibble value, with cmov
consuming all 3 of those results. But there's lots of ILP across iterations with only the shr edx,4
and the pointer decrement as loop-carried dependencies. I could have saved 1 byte of code-size by arranging so I could use cmp al, 'a'
or something. And/or add al,'0'
if I didn't care about CPUs that rename AL separately from EAX.
使用用十六进制数字同时包含9
和a
的数字来检查1错误的测试用例:
Testcase that checks for off-by-1 errors by using a number that has both 9
and a
in its hex digits:
$ nasm -felf32 -g -Fdwarf itohex.asm && gcc -g -fno-pie -no-pie -O3 -m32 hex.c itohex.o && ./a.out 0x19a2d0fb
19a2d0fb
具有SSE2,SSSE3,AVX2或AVX512F的SIMD,以及具有AVX512VBMI的〜2条指令
对于SSSE3和更高版本,最好将字节混洗用作半字节查找表.
SIMD with SSE2, SSSE3, AVX2 or AVX512F, and ~2 instructions with AVX512VBMI
With SSSE3 and later, it's best to use a byte shuffle as a nibble lookup table.
大多数这些SIMD版本都可以使用两个压缩的32位整数作为输入,结果向量的低8字节和高8字节包含可以用movq
和movhps
分别存储的单独结果. 根据您的随机播放控件,这就像将其用于一个64位整数一样.
Most of these SIMD versions could be used with two packed 32-bit integers as input, with the low and high 8 bytes of the result vector containing separate results that you can store separately with movq
and movhps
. Depending on your shuffle control, this is exactly like using it for one 64-bit integer.
SSSE3 pshufb
并行查找表.无需弄乱循环,我们可以在具有pshufb
的CPU上执行一些SIMD操作. (SSSE3甚至不是x86-64的基准;它是Intel Core2和AMD Bulldozer的新功能.)
SSSE3 pshufb
parallel lookup table. No need to mess around with loops, we can do this with a few SIMD operations, on CPUs that have pshufb
. (SSSE3 is not baseline even for x86-64; it was new with Intel Core2 and AMD Bulldozer).
pshufb
is a byte shuffle that's controlled by a vector, not an immediate (unlike all earlier SSE1/SSE2/SSE3 shuffles). With a fixed destination and a variable shuffle-control, we can use it as a parallel lookup table to do 16x lookups in parallel (from a 16 entry table of bytes in a vector).
因此,我们将整个整数加载到向量寄存器中,并通过移位和 punpcklbw
.然后使用pshufb
将这些半字节映射到十六进制数字.
So we load the whole integer into a vector register, and unpack its nibbles to bytes with a bit-shift and punpcklbw
. Then use a pshufb
to map those nibbles to hex digits.
这给我们留下了ASCII码的XMM寄存器,其最低有效位是寄存器的最低字节.由于x86是低位字节序,因此没有免费的方法可以将它们以相反的顺序存储到内存中,首先是MSB.
That leaves us with the ASCII digits an XMM register with the least significant digit as the lowest byte of the register. Since x86 is little-endian, there's no free way to store them to memory in the opposite order, with the MSB first.
我们可以使用额外的pshufb
将ASCII字节重新排序为打印顺序,或者在整数寄存器的输入上使用bswap
(并反转半字节->字节解包).如果整数是从内存中来的,则通过bswap
的整数寄存器有点糟(尤其是对于AMD Bulldozer系列),但是如果首先在GP寄存器中包含整数,那将非常好.
We can use an extra pshufb
to reorder the ASCII bytes into printing order, or use bswap
on the input in an integer register (and reverse the nibble -> byte unpacking). If the integer is coming from memory, going through an integer register for bswap
kinda sucks (especially for AMD Bulldozer-family), but if you have the integer in a GP register in the first place it's pretty good.
;; NASM syntax, i386 System V calling convention
section .rodata
align 16
hex_lut: db "0123456789abcdef"
low_nibble_mask: times 16 db 0x0f
reverse_8B: db 7,6,5,4,3,2,1,0, 15,14,13,12,11,10,9,8
;reverse_16B: db 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0
section .text
global itohex_ssse3 ; tested, works
itohex_ssse3:
mov eax, [esp+4] ; out pointer
movd xmm1, [esp+8] ; number
movdqa xmm0, xmm1
psrld xmm1, 4 ; right shift: high nibble -> low (with garbage shifted in)
punpcklbw xmm0, xmm1 ; interleave low/high nibbles of each byte into a pair of bytes
pand xmm0, [low_nibble_mask] ; zero the high 4 bits of each byte (for pshufb)
; unpacked to 8 bytes, each holding a 4-bit integer
movdqa xmm1, [hex_lut]
pshufb xmm1, xmm0 ; select bytes from the LUT based on the low nibble of each byte in xmm0
pshufb xmm1, [reverse_8B] ; printing order is MSB-first
movq [eax], xmm1 ; store 8 bytes of ASCII characters
ret
;; The same function for 64-bit integers would be identical with a movq load and a movdqu store.
;; but you'd need reverse_16B instead of reverse_8B to reverse the whole reg instead of each 8B half
可以将AND掩码和pshufb控件打包到一个16字节向量中,类似于下面的itohex_AVX512F
.
It's possible to pack the AND mask and the pshufb control into one 16-byte vector, similar to itohex_AVX512F
below.
AND_shuffle_mask: times 8 db 0x0f ; low half: 8-byte AND mask
db 7,6,5,4,3,2,1,0 ; high half: shuffle constant that will grab the low 8 bytes in reverse order
将其加载到向量寄存器中,并将其用作AND掩码,然后将其用作pshufb
控件,以相反的顺序获取低8位字节,将其保留为高8位.最终结果(8 ASCII十六进制)位)将位于XMM寄存器的上半部分,因此请使用movhps [eax], xmm1
.在Intel CPU上,这仍然只是1个融合域uop,因此它与movq
一样便宜.但是在Ryzen上,它需要在商店顶部进行洗牌.另外,如果您要并行转换两个整数或一个64位整数,则此技巧没有用.
Load it into a vector register and use it as an AND mask, then use it as a pshufb
control to grab the low 8 bytes in reverse order, leaving them in the high 8. Your final result (8 ASCII hex digits) will be in the top half of an XMM register, so use movhps [eax], xmm1
. On Intel CPUs, this is still only 1 fused-domain uop, so it's just as cheap as movq
. But on Ryzen, it costs a shuffle on top of a store. Plus, this trick is useless if you want to convert two integers in parallel, or a 64-bit integer.
SSE2,保证在x86-64中可用:
在没有SSSE3 pshufb
的情况下,我们需要依靠标量bswap
将字节以正确的打印顺序放置,而punpcklbw
则需要另一种方式来与每个对的高半字节交织.
Without SSSE3 pshufb
, we need to rely on scalar bswap
to put the bytes in printing right order, and punpcklbw
the other way to interleave with the high nibble of each pair first.
代替表查找,我们只需添加'0'
,然后为大于9的数字添加另一个'a' - ('0'+10)
(将它们置于'a'..'f'
范围内). SSE2的压缩字节比较大于 pcmpgtb
.加上按位AND,这就是我们有条件添加的全部内容.
Instead of a table lookup, we simply add '0'
, and add another 'a' - ('0'+10)
for digits greater than 9 (to put them into the 'a'..'f'
range). SSE2 has a packed byte compare for greater-than, pcmpgtb
. Along with a bitwise AND, that's all we need to conditionally add something.
itohex: ; tested, works.
global itohex_sse2
itohex_sse2:
mov edx, [esp+8] ; number
mov ecx, [esp+4] ; out pointer
;; or enter here for fastcall arg passing. Or rdi, esi for x86-64 System V. SSE2 is baseline for x86-64
bswap edx
movd xmm0, edx
movdqa xmm1, xmm0
psrld xmm1, 4 ; right shift: high nibble -> low (with garbage shifted in)
punpcklbw xmm1, xmm0 ; interleave high/low nibble of each byte into a pair of bytes
pand xmm1, [low_nibble_mask] ; zero the high 4 bits of each byte
; unpacked to 8 bytes, each holding a 4-bit integer, in printing order
movdqa xmm0, xmm1
pcmpgtb xmm1, [vec_9]
pand xmm1, [vec_af_add] ; digit>9 ? 'a'-('0'+10) : 0
paddb xmm0, [vec_ASCII_zero]
paddb xmm0, xmm1 ; conditional add for digits that were outside the 0..9 range, bringing them to 'a'..'f'
movq [ecx], xmm0 ; store 8 bytes of ASCII characters
ret
;; would work for 64-bit integers with 64-bit bswap, just using movq + movdqu instead of movd + movq
section .rodata
align 16
vec_ASCII_zero: times 16 db '0'
vec_9: times 16 db 9
vec_af_add: times 16 db 'a'-('0'+10)
; 'a' - ('0'+10) = 39 = '0'-9, so we could generate this from the other two constants, if we were loading ahead of a loop
; 'A'-('0'+10) = 7 = 0xf >> 1. So we could generate this on the fly from an AND. But there's no byte-element right shift.
low_nibble_mask: times 16 db 0x0f
此版本比大多数其他版本需要更多的向量常量. 4x 16字节为64字节,可容纳在一个高速缓存行中.您可能想在第一个向量之前align 64
而不是仅在align 16
之前,所以它们都来自同一缓存行.
This version needs more vector constants than most others. 4x 16 bytes is 64 bytes, which fits in one cache line. You might want to align 64
before the first vector instead of just align 16
, so they all come from the same cache line.
这甚至可以仅使用MMX来实现,仅使用8字节常量,但是您需要一个emms
,因此这可能仅在没有SSE2的非常老的CPU上是一个好主意,或者将128位操作分成64位(例如Pentium-M或K8).在带有消除运动的矢量寄存器的现代CPU(如Bulldozer和IvyBrige)上,它仅适用于XMM寄存器,不适用于MMX.我确实安排了寄存器的使用,因此第二个movdqa
不在关键路径上,但是我第一次没有这样做.
This could even be implemented with only MMX, using only 8-byte constants, but then you'd need an emms
so it would probably only be a good idea on very old CPUs which don't have SSE2, or which split 128-bit operations into 64-bit halves (e.g. Pentium-M or K8). On modern CPUs with mov-elimination for vector registers (like Bulldozer and IvyBrige), it only works on XMM registers, not MMX. I did arrange the register usage so the 2nd movdqa
is off the critical path, but I didn't do that for the first.
AVX可以保存movdqa
,但是更有趣的是 AVX2,我们可以一次从大量输入中一次生成32个字节的十六进制数字. 2个64位整数或4个32位整数;使用128-> 256位广播负载将输入数据复制到每个通道中.从那里开始,带控制矢量的车道vpshufb ymm
从每个128位通道的低半部分或高半部分读取,应该为您设置低字节中的低64位输入的半字节,并且该半字节用于在高通道中解压缩输入的高64位.
AVX can save a movdqa
, but more interesting is with AVX2 we can potentially produce 32 bytes of hex digits at a time from large inputs. 2x 64-bit integers or 4x 32-bit integers; use a 128->256-bit broadcast load to replicate the input data into each lane. From there, in-lane vpshufb ymm
with a control vector that read from the low or high half of each 128-bit lane should set you up with the nibbles for the low 64 bits of input unpacked in the low lane, and the nibbles for the high 64 bits of input unpacked in the high lane.
或者,如果输入数字来自不同的来源,那么vinserti128
最高的可能在某些CPU上是值得的,而不是仅执行单独的128位操作.
Or if the input numbers come from different sources, maybe vinserti128
the high one might be worth it on some CPUs, vs. just doing separate 128-bit operations.
AVX512VBMI (Cannonlake/IceLake,在Skylake-X中不存在)具有2个寄存器字节的洗牌 ,可以将puncklbw
交织与字节反转结合起来. 或更妙的是,我们有 VPMULTISHIFTQB
可以提取8个未对齐的8位源中每个qword的位域.
AVX512VBMI (Cannonlake/IceLake, not present in Skylake-X) has a 2-register byte shuffle vpermt2b
that could combine the puncklbw
interleaving with byte-reversing. Or even better, we have VPMULTISHIFTQB
which can extract 8 unaligned 8-bit bitfields from each qword of the source.
我们可以使用它来将我们想要的半字节直接提取到我们想要的顺序中,从而避免了单独的右移指令. (它仍然带有垃圾位,但是vpermb
忽略高垃圾率.)
We can use this to extract the nibbles we want into the order we want directly, avoiding a separate right-shift instruction. (It still comes with garbage bits, but vpermb
ignores high garbage.)
要将其用于64位整数,请使用广播源和多移位控件,该控件在向量底部解包输入qword的高32位,在向量顶部解压缩32位. (假设小端输入)
To use this for 64-bit integers, use a broadcast source and a multishift control that unpacks the high 32 bits of the input qword in the bottom of the vector, and the low 32 bits in the top of the vector. (Assuming little-endian input)
要将其用于64位以上的输入,请使用vpmovzxdq
将每个输入dword零扩展为qword ,并为vpmultishiftqb
设置相同的28,24,.每个qword中的..,4,0控制模式. (例如,从输入的256位向量或四个dword-> ymm reg生成输出的zmm向量,以避免时钟速度限制和实际运行512位AVX512指令的其他影响.)
To use this for more than 64 bits of input, use vpmovzxdq
to zero-extend each input dword into a qword, setting up for vpmultishiftqb
with the same 28,24,...,4,0 control pattern in each qword. (e.g. producing a zmm vector of output from a 256-bit vector of input, or four dwords -> a ymm reg to avoid clock-speed limits and other effects of actually running a 512-bit AVX512 instruction.)
请注意,较宽的vpermb
每个控制字节使用5或6位,这意味着您需要将hexLUT广播到ymm或zmm寄存器,或在内存中重复.
Beware that wider vpermb
uses 5 or 6 bits of each control byte, meaning you'll need to broadcast the hexLUT to a ymm or zmm register, or repeat it in memory.
itohex_AVX512VBMI: ; Tested with SDE
vmovq xmm1, [multishift_control]
vpmultishiftqb xmm0, xmm1, qword [esp+8]{1to2} ; number, plus 4 bytes of garbage. Or a 64-bit number
mov ecx, [esp+4] ; out pointer
;; VPERMB ignores high bits of the selector byte, unlike pshufb which zeroes if the high bit is set
;; and it takes the bytes to be shuffled as the optionally-memory operand, not the control
vpermb xmm1, xmm0, [hex_lut] ; use the low 4 bits of each byte as a selector
vmovq [ecx], xmm1 ; store 8 bytes of ASCII characters
ret
;; For 64-bit integers: vmovdqa load [multishift_control], and use a vmovdqu store.
section .rodata
align 16
hex_lut: db "0123456789abcdef"
multishift_control: db 28, 24, 20, 16, 12, 8, 4, 0
; 2nd qword only needed for 64-bit integers
db 60, 56, 52, 48, 44, 40, 36, 32
# I don't have an AVX512 CPU, so I used Intel's Software Development Emulator
$ /opt/sde-external-8.4.0-2017-05-23-lin/sde -- ./a.out 0x1235fbac
1235fbac
vpermb xmm
不能穿越车道,因为只涉及一个车道(与vpermb ymm
或zmm不同).但不幸的是,在CannonLake(根据instlatx64结果)上,它仍然具有3个周期的延迟因此pshufb
对于延迟来说会更好.但是pshufb
基于高位有条件地为零,因此需要屏蔽控制向量.假设vpermb xmm
仅为1 uop,这会使吞吐量变得更糟.在可以将向量常量保留在寄存器中(而不是内存操作数)的循环中,它只保存1条指令而不是2条指令.
vpermb xmm
is not lane-crossing because there's only one lane involved (unlike vpermb ymm
or zmm). But unfortunately on CannonLake (according to instlatx64 results), it still has 3-cycle latency so pshufb
would be better for latency. But pshufb
conditionally zeros based on the high bit so it requires masking the control vector. That makes it worse for throughput, assuming vpermb xmm
is only 1 uop. In a loop where we can keep the vector constants in registers (instead of memory operands), it only saves 1 instruction instead of 2.
(更新:是, https://uops.info/确认vpermb
为1 uop与3c延迟,在Cannon Lake和Ice Lake上的吞吐量为1c.对于vpshufb
xmm/ymm,ICL的吞吐量为0.5c
(Update: yes, https://uops.info/ confirms vpermb
is 1 uop with 3c latency, 1c throughput on Cannon Lake and Ice Lake. ICL has 0.5c throughput for vpshufb
xmm/ymm)
使用AVX512F,在将数字广播到XMM寄存器中之后,我们可以使用合并掩码使一个双字右移,而另一个双字保持不变.
With AVX512F, we can use merge-masking to right-shift one dword while leaving the other unmodified, after broadcasting the number into an XMM register.
或者我们可以使用AVX2可变移位vpsrlvd
来做完全相同的事情,移位计数向量为[4, 0, 0, 0]
. Intel Skylake及更高版本具有单-uop vpsrlvd
; Haswell/Broadwell取多个uops(2p0 + p5). Ryzen的vpsrlvd xmm
是1 uop,3c延迟,每2个时钟吞吐量1个. (比立即轮班更糟糕).
Or we could use an AVX2 variable-shift vpsrlvd
to do exactly the same thing, with a shift-count vector of [4, 0, 0, 0]
. Intel Skylake and later has single-uop vpsrlvd
; Haswell/Broadwell take multiple uops (2p0 + p5). Ryzen's vpsrlvd xmm
is 1 uop, 3c latency, 1 per 2 clock throughput. (Worse than immediate shifts).
这时,我们只需要一个单寄存器字节混洗vpshufb
来交织半字节和字节反转.但是,然后您需要一个掩码寄存器中的常数,该常数需要几个指令来创建.在将多个整数转换为十六进制的循环中,这将是更大的胜利.
Then we only need a single-register byte shuffle, vpshufb
, to interleave nibbles and byte-reverse. But then you need a constant in a mask register which takes a couple instructions to create. It would be a bigger win in a loop converting multiple integers to hex.
对于该函数的非循环独立版本,我将两半的一个16字节常量用于不同的东西:set1_epi8(0x0f)
在上半部分,而8字节的pshufb
控制向量在下半部分一半.这并不能节省很多,因为EVEX广播内存操作数允许vpandd xmm0, xmm0, dword [AND_mask]{1to4}
,只需要4个字节的空间即可存储常量.
For a non-looping stand-alone version of the function, I used two halves of one 16-byte constant for different things: set1_epi8(0x0f)
in the top half, and 8 bytes of pshufb
control vector in the low half. This doesn't save a lot because EVEX broadcast memory operands allow vpandd xmm0, xmm0, dword [AND_mask]{1to4}
, only requiring 4 bytes of space for a constant.
itohex_AVX512F: ;; Saves a punpcklbw. tested with SDE
vpbroadcastd xmm0, [esp+8] ; number. can't use a broadcast memory operand for vpsrld because we need merge-masking into the old value
mov edx, 1<<3 ; element #3
kmovd k1, edx
vpsrld xmm0{k1}, xmm0, 4 ; top half: low dword: low nibbles unmodified (merge masking). 2nd dword: high nibbles >> 4
vmovdqa xmm2, [nibble_interleave_AND_mask]
vpand xmm0, xmm0, xmm2 ; zero the high 4 bits of each byte (for pshufb), in the top half
vpshufb xmm0, xmm0, xmm2 ; interleave nibbles from the high two dwords into the low qword of the vector
vmovdqa xmm1, [hex_lut]
vpshufb xmm1, xmm1, xmm0 ; select bytes from the LUT based on the low nibble of each byte in xmm0
mov ecx, [esp+4] ; out pointer
vmovq [ecx], xmm1 ; store 8 bytes of ASCII characters
ret
section .rodata
align 16
;hex_lut: db "0123456789abcdef"
nibble_interleave_AND_mask: db 15,11, 14,10, 13,9, 12,8 ; shuffle constant that will interleave nibbles from the high half
times 8 db 0x0f ; high half: 8-byte AND mask
这篇关于如何将二进制整数转换为十六进制字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!