问题描述
我正在研究最小操作码大小 x86-64 strlen 的实现,该代码实现的高尔夫/二进制可执行文件的大小不应该超过某些大小(请考虑一下demoscene for简单).
总体思路来自此处,尺寸优化思路来自和此处.
输入字符串地址在rdi
中,最大长度应不大于Int32
xor eax,eax ; 2 bytes
or ecx,-1 ; 3 bytes
repne scasb ; 2 bytes
not ecx ; 2 bytes
dec ecx ; 2 bytes
最终结果在ecx
中,总计 11个字节.
问题是关于将ecx
设置为-1
选项1已经说明
or ecx,-1 ; 3 bytes
选项2
lea ecx,[rax-1] ; 3 bytes
选项3
stc ; 1 byte
sbb ecx,ecx ; 2 bytes
选项4,可能是最慢的一个
push -1 ; 2 bytes
pop rcx ; 1 byte
我了解:
选项1依赖于先前的ecx
值
选项2依赖于先前的rax
值
选项3我不确定它是否依赖于先前的ecx
值?
选项4是最慢的一个吗?
这里有明显的赢家吗?
标准是保持 opcodes的大小尽可能小,并选择最佳的一种性能.
我完全知道有一些使用现代cpu指令的实现,但是这种传统方法似乎是最小的.
对于足够好用的hacky版本,我们知道rdi
具有有效地址. edi
可能不是一个小整数,因此很可能是2个字节的mov ecx, edi
.但这并不安全,因为RDI可能会指向4GiB边界,因此很难证明它是安全的.除非您使用x32之类的ILP32 ABI,否则所有指针都在4GiB标记之下.
因此,您可能需要使用push rdi/pop rcx复制完整的RDI,每个复制1个字节.但这增加了短字符串启动的额外延迟.如果没有长度大于其起始地址的字符串,则应该是安全的. (但这对于在.data,.bss或.rodata中的静态存储来说是合理的,如果您有任何巨大的数组;例如,Linux non-PIE可执行文件的加载频率约为0x401000
= 1<< ; 22.)
如果您只希望rdi指向终止的0
字节,而不是实际需要计数,那么这很好.或者,如果您在另一个寄存器中有起始指针,则可以执行sub edi, edx
或类似的操作并以这种方式获取长度,而不用处理rcx
结果. (如果您知道结果适合32位,则不需要sub rdi, rdx
,因为您知道该位的高位无论如何都为零.而且高输入位不会影响加/减的低输出位;进位传播从左到右.)
对于已知小于255个字节的字符串,可以使用mov cl, -1
(2个字节).这使得rcx
至少为0xFF,并且更高,具体取决于其中保留了多少高位垃圾. (这在Nehalem上有部分reg停顿,并且在读取RCX时更早,否则仅依赖旧的RCX).无论如何,然后按mov al, -2
/sub al, cl
可以得到8位整数的长度.这可能有用,也可能无效.
根据调用方的不同,rcx
可能已经保存了一个指针值,在这种情况下,如果可以使用指针减法,则可以使其保持不变.
在您提出的选项中
lea ecx,[rax-1]
非常好,因为您只需对eax
进行零位置零,它是一种便宜的1 uop指令,具有1个周期的延迟,并且可以在所有主流CPU的多个执行端口上运行.
如果已经有了另一个具有已知常数值的寄存器,尤其是将其异或为零的寄存器,则3字节lea
几乎总是创建常数的最有效的3字节方式(如果可行). (请参阅将CPU寄存器中的所有位有效地设置为1 ).
是的,repne scasb
非常紧凑.在典型的Intel CPU上,其启动开销可能约为15个周期,并且根据 Agner Fog ,它会发出> = 6n uop,吞吐量为> = 2n个周期,其中n
是计数(即,它与每个字节进行2个周期进行长时间比较,而隐藏了启动开销),因此它使lea
的成本相形见war.
对ecx
的错误依赖关系可能会延迟其启动,因此您绝对希望lea
.
repne scasb
可能对您正在执行的任何操作都足够快,但是它比pcmpeqb
/pmovmsbk
/cmp
慢.对于较短的固定长度字符串,假设长度为4或8个字节(包括结尾0),则整数cmp
/jne
非常很好.安全地阅读字符串,即您不必担心页面结尾处的""
.但是,此方法的开销随字符串长度而缩放.例如,对于字符串length = 7,您可以执行4、2和1个操作数大小,或者可以执行两个重叠1个字节的dword比较.像cmp dword [rdi], first_4_bytes / jne
; cmp dword [rdi+3], last_4_bytes / jne
.
有关LEA的详细信息
在Sandybridge系列CPU上,lea
可以在与xor
相同的周期中分派给执行单元,并且xor
-零被发布到乱序的CPU内核中. xor
-清零是在发布/重命名阶段处理的,因此uop会以已执行"状态进入ROB.该指令永远不必等待RAX. (除非xor和lea
之间发生中断,但即使那样,我仍然认为在还原RAX之后并且在lea
执行之前会有一个序列化指令,因此它不会卡在等待状态.) >
简单的lea
可以在SnB的port0或port1或Skylake的port1/port5上运行(每个时钟吞吐量2个,但有时在不同的SnB系列CPU上可以使用不同的端口).这是1个周期的延迟,因此很难做得更好.
使用mov ecx, -1
(5字节)(可以在任何ALU端口上运行)不会带来任何提速.
在AMD Ryzen上,将64位模式下的lea r32, [m]
视为只能在2个端口上运行且延迟为2c(而不是1)的慢速" LEA.更糟糕的是,Ryzen无法消除Xor归零
您进行的微基准测试仅测量无误依赖性和延迟的版本的吞吐量.这通常是一种有用的措施,您确实得到了正确的答案,即lea
是最佳选择.
纯吞吐量是否可以准确反映您的实际用例,这是另一回事.如果字符串比较位于长路径或循环传输的数据依赖关系链的一部分中,而没有被jcc
破坏,则您实际上可能依赖于延迟而不是吞吐量,而不是通过jcc
进行中断,从而为您提供分支预测和推测性执行. (但是无分支代码通常更大,所以这不太可能.)
stc
/sbb ecx,ecx
很有趣,但是只有AMD CPU将sbb
视为依赖项破坏(仅取决于CF,而不取决于整数寄存器).在Intel Haswell及更早版本上,sbb
是2 uop指令(因为它具有3个输入:2个GP整数+标志).它具有2c的延迟,这就是为什么它表现如此差的原因. (等待时间是循环承载的dep链.)
缩短序列的其他部分
根据您正在执行的操作,也许也可以使用strlen+2
,但是可以抵消另一个常量或其他东西. dec ecx
在32位代码中只有1个字节,但是x86-64没有简短的inc/dec
指令.因此,/dec在64位代码中不那么出色.
在repne scas
之后,您拥有ecx = -len - 2
(如果您以ecx = -1
开头),而not
为您提供-x-1
(即+len + 2 - 1
).
; eax = 0
; ecx = -1
repne scasb ; ecx = -len - 2
sub eax, ecx ; eax = +len + 2
I'm investigating a minimal opcode size x86-64 strlen implementation for my code golfing / binary executable that is not supposed to exceed some size (think of demoscene for simplicity).
General idea comes from here, size optimization ideas from here and here.
Input string address is in rdi
, max length should be not bigger than Int32
xor eax,eax ; 2 bytes
or ecx,-1 ; 3 bytes
repne scasb ; 2 bytes
not ecx ; 2 bytes
dec ecx ; 2 bytes
Final result is in ecx
in 11 bytes total.
The question is about setting ecx
to -1
Option 1 already stated
or ecx,-1 ; 3 bytes
Option 2
lea ecx,[rax-1] ; 3 bytes
Option 3
stc ; 1 byte
sbb ecx,ecx ; 2 bytes
Option 4 , probably the slowest one
push -1 ; 2 bytes
pop rcx ; 1 byte
I understand that:
Option 1 has dependency on previous ecx
value
Option 2 has dependency on previous rax
value
Option 3 I'm not sure if it has dependency on previous ecx
value?
Option 4 is the slowest one?
Is there a clear winner here?
The criteria is to keep the opcodes size as small as possible and choose the best one performance wise.
I'm fully aware there are implementations using modern cpu instructions, but this legacy approach seems the smallest one.
For a hacky good-enough version, we know rdi
has a valid address. It's very likely that edi
is not a small integer, thus 2 byte mov ecx, edi
. But this isn't safe is RDI could be pointing just past a 4GiB boundary so it's hard to prove it's safe. Unless you're using an ILP32 ABI like x32 so all pointers are below the 4GiB mark.
So you might need to copy the full RDI with push rdi / pop rcx, 1 byte each. But that adds extra latency for short-string startup. It should be safe if you have no strings with length higher than their starting address. (But that is plausible for static storage in .data, .bss, or .rodata if you have any huge arrays; for example Linux non-PIE executables are loaded at around 0x401000
= 1<<22.)
This is great if you just want rdi pointing to the terminating 0
byte, instead of actually needing a count. Or if you have the start pointer in another register so you can do sub edi, edx
or something and get the length that way instead of processing the rcx
result. (If you know the result fits in 32 bits, you don't need sub rdi, rdx
because you know the upper bits of that would be zero anyway. And high input bits don't affect low output bits for add/sub; carry propagates left to right.)
For strings known to be under 255 bytes, you can use mov cl, -1
(2 bytes). That makes rcx
at least 0xFF, and higher depending on what high garbage was left in it. (This has a partial-reg stall on Nehalem and earlier when RCX is read, otherwise just a dependency on the old RCX). Anyway, then mov al, -2
/ sub al, cl
to get the length as an 8-bit integer. This may or may not be useful.
Depending on the caller, rcx
might have already been holding a pointer value, in which case you could leave it untouched if you can use pointer subtraction.
Out of the options you proposed
lea ecx,[rax-1]
is very good because you just xor-zeroed eax
, and it's a cheap 1 uop instruction with 1 cycle latency and can run on multiple execution ports on all mainstream CPUs.
When you already have another register with a known constant value, especially one that's xor-zeroed, 3-byte lea
almost always the most efficient 3-byte way to create a constant, if it works. (See Set all bits in CPU register to 1 efficiently).
Yes, repne scasb
is very compact. Its startup overhead is maybe something like 15 cycles on a typical Intel CPU, and according to Agner Fog, it issues >=6n uops with a throughput of >= 2n cycles, where n
is the count (i.e. 2 cycles per byte it compares for long compares, where the startup overhead is hidden), so it dwarfs the cost of lea
.
Something with a false dependency on ecx
could delay its startup, so you definitely want lea
.
repne scasb
is probably fast enough for whatever you're doing, but it's slower than pcmpeqb
/ pmovmsbk
/ cmp
. For short fixed-length strings, integer cmp
/ jne
is very good when the length is 4 or 8 bytes (including the terminating 0), assuming you can safely over-read your strings, i.e. you don't have to worry about ""
at the end of a page. This method has overhead that scales with string length, though. For string length=7, for example, you could do 4, 2, and 1 operand sizes, or you could do two dword compares overlapping by 1 byte. like cmp dword [rdi], first_4_bytes / jne
; cmp dword [rdi+3], last_4_bytes / jne
.
More details about LEA
On a Sandybridge-family CPU, the lea
could be dispatched to an execution unit in the same cycle as it and the xor
-zero were issued into the out-of-order CPU core. xor
-zeroing is handled in the issue/rename stage, so the uop enters the ROB in an "already-executed" state. It's impossible for the instruction to ever have to wait for RAX. (Unless an interrupt happens between the xor and the lea
, but even then I think there'd be a serializing instruction after restoring RAX and before the lea
could execute, so it couldn't be stuck waiting.)
Simple lea
can run on port0 or port1 on SnB, or port1 / port5 on Skylake (2 per clock throughput, but sometimes different ports on different SnB-family CPUs). It's 1 cycle latency, so it's hard to do much better.
It's unlikely you'd see any speedup from using mov ecx, -1
(5 bytes) which can run on any ALU port.
On an AMD Ryzen, lea r32, [m]
in 64-bit mode is treated as a "slow" LEA that can only run on 2 ports, and has 2c latency instead of 1. Worse, Ryzen doesn't eliminate xor-zeroing.
The microbenchmark test you did only measures throughput for the versions with no false dependencies, not latency. This is often a useful measure, and you did happen to get the right answer that lea
is the best choice.
Whether pure throughput accurately reflects anything about your real use case is another matter. You might actually depend on the latency, not throughput, if the string compare is on the critical path as part of a long or loop-carried data dependency chain not broken by a jcc
to give you branch-prediction + speculative execution. (But branchless code is often larger, so this is unlikely).
stc
/ sbb ecx,ecx
is interesting, but only AMD CPUs treat sbb
as dependency-breaking (only depending on CF, not the integer register). On Intel Haswell and earlier, sbb
is a 2 uop instruction (because it has 3 inputs: 2 GP integer + flags). It has 2c latency, which is why it performs so badly. (The latency is a loop-carried dep chain.)
Shortening other parts of the sequence
Depending what you're doing, you may be able to use strlen+2
just as well, but offsetting another constant or something. dec ecx
is only 1 byte in 32-bit code, but x86-64 doesn't have the short-form inc/dec
instructions. So not / dec isn't as cool in 64-bit code.
After repne scas
, you have ecx = -len - 2
(if you started with ecx = -1
), and not
gives you -x-1
(i.e. +len + 2 - 1
).
; eax = 0
; ecx = -1
repne scasb ; ecx = -len - 2
sub eax, ecx ; eax = +len + 2
这篇关于最小操作码大小x86-64 strlen实现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!