问题描述
的第3章计算机系统的程序员观点(第二版)提到cltq
等同于movslq %eax, %rax
.
Chapter 3 of Computer Systems A Programmer's Perspective (2nd Edition) mentions thatcltq
is equivalent to movslq %eax, %rax
.
为什么他们创建新指令(cltq
)而不是仅使用movslq %eax,%rax
?那不是多余的吗?
Why did they create a new instruction (cltq
) instead of just using movslq %eax,%rax
? Isn't that redundant?
推荐答案
TL; DR :尽可能使用cltq
,因为它比完全等效的movslq %eax, %rax
短一个字节.这是一个非常小的优势(因此,请不要牺牲其他任何东西来实现这一目标),但是如果您想对其进行很多符号扩展,请选择eax
.
TL;DR: use cltq
when possible, because it's one byte shorter than the exactly-equivalent movslq %eax, %rax
. That's a very minor advantage (so don't sacrifice anything else to make this happen) but choose eax
if you're going to want to sign-extend it a lot.
这与编译器-编写器最相关(编译有符号整数循环计数器索引数组);诸如符号扩展循环计数器之类的事情仅在编译器无法利用带符号的溢出作为未定义行为来避免这种情况的情况下才会发生.人类程序员将只决定签名与未签名的内容,以保存指令.
This is mostly relevant for compiler-writers (compiling signed-integer loop counters indexing arrays); stuff like sign-extending a loop counter every iteration only happens when compilers don't manage to take advantage of signed overflow being undefined behaviour to avoid it. Human programmers will just decide what's signed vs. unsigned to save instructions.
相关:针对不同大小的,在RAX(cltq
)或从EAX扩展到EDX到EDX:EAX的指令的 Intel vs.AT& T助记符,完整运行(cltd
),以及等效的movsx
/movs?t?
: cltq在组装中做什么?.
Related: complete run-down on Intel vs. AT&T mnemonics for the different sizes of the instructions that sign-extend within RAX (cltq
), or from EAX into EDX:EAX (cltd
), with the equivalent movsx
/ movs?t?
: What does cltq do in assembly?.
实际上, MOVSX 的32-> 64位形式(称为在AT& T语法中),是AMD64中的新功能. 英特尔语法助记符实际上是 MOVSXD .操作码为63 /r
(因此,它是3个字节,包括必需的REX前缀,而对于8-> 64或16-> 64 MOVSX,则为4个字节). AMD改变了ARPL的操作码的用途,该功能在64位模式下不存在.
Actually, the 32->64 bit form of MOVSX (called movslq
in AT&T syntax), is the new one, new with AMD64. The Intel-syntax mnemonic is actually MOVSXD. The opcode is 63 /r
(so it's 3 bytes including the necessary REX prefix, vs. 4 bytes for 8->64 or 16->64 MOVSX). AMD repurposed the opcode from ARPL, which doesn't exist in 64-bit mode.
要了解历史,请记住,当前x86并非一次全部设计.首先是16位8086,根本没有MOVSZ/MOVZX,只有CBW和CWD.然后386添加了MOVS/ZX(以及用于在eax或edx中进行符号扩展的CBW/CWD的较宽版本).然后,AMD将所有这些扩展到了64位.
To understand the history, remember that current x86 wasn't designed all at once. First there was 16-bit 8086, with not MOVSZ/MOVZX at all, just CBW and CWD. Then 386 added MOVS/ZX (and wider versions of CBW/CWD for sign-extending within eax or into edx). Then AMD extended all of that to 64-bit.
现有MOVSX操作码的REX版本仍然具有8位或16位源,但是符号一直扩展到64位而不是32位.操作数大小前缀使您可以编码movsbw
,也称为movsx r16, r/m8
. IDK如果同时使用操作数大小的前缀和REX.W,会发生什么情况.或者,如果您将操作数大小的前缀与16位源格式的MOVSX一起使用,会发生什么情况.可能这只是一种编码MOV的昂贵方法,例如使用不带REX前缀的63 /r
(英特尔的insn设置手册建议反对).
The REX versions of the existing MOVSX opcodes still have an 8 or 16bit source, but sign extend all the way to 64 bits instead of just 32. The operand-size prefix lets you encode movsbw
, aka movsx r16, r/m8
. IDK what happens if you use an operand-size prefix and REX.W at the same time. Or what happens if you use an operand-size prefix with the 16bit source form of MOVSX. Probably it's just an expensive way to encode MOV, like using 63 /r
without a REX prefix (which the Intel's insn set manual recommends against).
cltq
( aka CDQE )很明显用REX.W前缀扩展现有的cwtl
(aka CWDE)的方法,以将操作数大小提升为64位. cbtw
(又名CBW)的原始格式是在8086年,早于MOVSX,并且是唯一对所有内容进行符号扩展的明智方法.由于以立即计数> 1进行的移位具有286功能,其他功能最差mov ah, al
/mov cl, 7
/sar ah, cl
的选项似乎是将符号位广播到所有位置.
cltq
(aka CDQE) is just the obvious way to extend the existing cwtl
(aka CWDE) with a REX.W prefix to promote the operand-size to 64 bits. The original form of this, cbtw
(aka CBW), was in 8086, predating MOVSX, and was the only sane way to sign-extend anything. Since shifts with immediate count>1 were a 286 feature, the least bad other option seems to be mov ah, al
/ mov cl, 7
/ sar ah, cl
to broadcast the sign bit to all positions.
也不要将cwtl
与cwtd
混淆( aka CWD :符号将ax扩展为dx:ax,例如为idiv进行设置.
Also, don't confuse cwtl
with cwtd
(aka CWD: sign extend ax into dx:ax, e.g. to set up for idiv).
AT& T助记符在这里非常恐怖. l
vs. d
,真的吗?英特尔助记符的末尾都带有e
,用于在rax内扩展的扩展名,而不是对rdx(部分)扩展的扩展名.除了CBW以外,但当然也将al扩展到了ax中,因为即使8086都具有16位寄存器,所以永远不需要在dl:al中存储16位值. idiv r/m8
使用ax作为源代码,而不是dl:al(并将结果放入ah,al)).
The AT&T mnemonics are pretty horrible here. l
vs. d
, really? The Intel mnemonics all have e
on the end for the ones that extend within rax, and not for the ones that extend into (part of) rdx. Except for CBW, but of course that extends al into ax, because even 8086 had 16bit registers, so never needed to store 16bit values in dl:al. idiv r/m8
uses ax as a source reg, not dl:al (and puts the results in ah, al)).
是的,这是x86汇编语言中许多冗余之一.例如sub eax,eax
到零rax与. (mov eax,0
并不完全是冗余的,因为它不会影响标志.如果您包括诸如冗余之类的细微差别,甚至是在不同执行端口上运行的指令,则有很多方法可以执行某些操作.)
Yes, this is one of many redundancies in x86 assembly language. e.g. sub eax,eax
to zero rax vs. xor eax,eax
. (mov eax,0
isn't totally redundant, because it doesn't affect flags. If you include slight differences like that as redundant, or even instructions that run on different execution ports, there are lots of ways to do some things.).
如果我有机会修改x86-64 ISA,我可能会给出MOVZX和MOVSX单字节操作码(而不是0F XX
两字节转义的操作码),至少是8位源代码版本.因此movsx eax, byte [mem]
将与mov al, [mem]
一样紧凑. (它们在Intel CPU上已经具有相同的性能:完全在加载端口中处理,没有ALU uop).大多数实际代码都无法利用[u]int16_t
数组来获得更高的缓存密度,因此我认为从word到dword或qword的movs/zx较为罕见.或者,也许有足够的宽字符代码来证明MOVZX r32/r64, r/m16
的较短操作码是合理的.为了腾出空间,我们可以完全删除CBW/CWDE/CDQE操作码.对于idiv,我可能会将CWD/CDQ/CQO保留为有用的设置,因为idiv没有一条指令.
If I had the chance to modify the x86-64 ISA, I would probably give MOVZX and MOVSX single-byte opcodes (instead of 0F XX
two-byte escaped opcodes), at least the 8-bit-source versions. So movsx eax, byte [mem]
would be as compact as mov al, [mem]
. (They're already the same performance on Intel CPUs: handled entirely in the load port, with no ALU uop). Most real code fails to take advantage of [u]int16_t
arrays for higher cache density, so I think movs/zx from word to dword or qword is rarer. Or maybe there's enough wide-character code around to justify shorter opcodes for MOVZX r32/r64, r/m16
. To make some room, we can drop the CBW / CWDE / CDQE opcode entirely. I might keep CWD / CDQ / CQO as a useful setup for idiv, which has no one-instruction equivalent.
实际上,可能具有更少的单字节操作码和更多的转义前缀会有用得多(例如,所以普通的SSE2 insns可以是2个操作码字节+ ModRM,而不是通常的3个或4个操作码字节).指令解码在高性能循环中使用较少的指令时不会出现瓶颈.但是,如果x86-64机器代码与32位机器代码有太大不同,则需要额外的解码晶体管.现在,由于功率限制使黑硅成为问题,因为核心永远不会需要同时将其32位解码器和64位解码器加电. AMD设计AMD64时并非如此. (错误的是,在32位和64位运行的逻辑线程之间的超线程交替循环会阻止您完全关闭其中的一个,如果它们是分开的.)
In reality, probably having fewer single-byte opcodes and more escape prefixes would be a lot more useful (e.g. so common SSE2 insns can be 2 opcode bytes + ModRM, instead of the usual 3 or 4 opcode bytes). Instruction-decoding is less of a bottleneck with shorter instructions in high-performance loops. But if x86-64 machine code is too different from 32-bit, we need extra decode transistors. That may be ok now that power limitations have made dark silicon a thing, because a core would never need its 32-bit decoder powered on at the same time as its 64-bit decoder. That wasn't the case when AMD was designing AMD64. (err, HyperThreading alternating cycles between logical threads running in 32-bit and 64-bit would stop you from fully shutting down either, if they were separate.)
代替CDQ,我们可以执行两个操作数的移位指令,并且目标无损,因此sar edx, eax, 31
将以3个字节的形式执行CDQ.删除一字节xchg-with-eax操作码(0x90 xchg eax,eax
NOP除外)将为 sar,shr,shl ,而无需将ModRM的Reg字段用作额外的操作码位.当然,请删除shift_count = 0的不影响标志的特殊情况,以消除对FLAGS的输入依赖性.
Instead of CDQ, we could made two-operand shift instructions, with a non-destructive destination, so sar edx, eax, 31
would do CDQ in 3 bytes. Dropping the one-byte xchg-with-eax opcodes (other than 0x90 xchg eax,eax
NOP) would free up lots of coding space for sar, shr, shl without needing the Reg field of the ModRM as extra opcode bits. And of course remove the doesn't-affect-flags special case for shift_count=0 to kill the input dependency on FLAGS).
(我也将setcc r/m8
更改为setcc r/m32
.或者也许是setcc r32/m8
.(内存dst无论如何都使用单独的ALU uop,因此它可以解码为setcc tmp32并存储其中的低8位).将目的地异或归零几乎总是使用它,而您必须将其与标志设置进行比较.)
(I'd also have changed setcc r/m8
to setcc r/m32
. Or maybe setcc r32/m8
. (Memory dst uses a separate ALU uop anyway, so it could decode as setcc tmp32 and store the low 8 of that). It's almost always used by xor-zeroing a destination, and you have to juggle that vs. the flag-setting.)
AMD有机会对AMD64进行(部分)处理,但是选择保守以共享尽可能多的指令解码晶体管. (不能为此而真的责怪他们,但是不幸的是,政治/经济情况导致x86在可预见的未来失去了遗留某些遗留行李的唯一机会.)这还意味着修改代码生成/分析软件的工作量减少了,但这是一次性的成本,并且与可能使每个x86-64 CPU运行速度更快,二进制文件更小相比,这是一笔不小的花费.
AMD had the chance to do (some of) this with AMD64, but chose to be conservative to share as many instruction-decode transistors as possible. (Can't really fault them for that, but it's unfortunate that political/economic circumstances resulted in x86 missing its only chance for the foreseeable future to drop some of its legacy baggage.) It also meant less work modifying code generation / analysis software, but that's a one-time cost and small potatoes compared to potentially making every x86-64 CPU run faster and have smaller binaries.
另请参见 x86 标签Wiki的问题,以了解更多信息链接,包括 NASM手册中的旧附录,其中记录了每一种形式指令介绍了.
See also the x86 tag wiki for more links, including this old appendix from the NASM manual documenting when every form of every instruction was introduced.
相关: 缺少MOVZX的32位寄存器64位寄存器 .
Related: MOVZX missing 32 bit register to 64 bit register.
这篇关于组装cltq和movslq的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!