问题描述
我可以使用MOV
指令将存储在存储器中的数据项移动到我选择的通用寄存器中.
MOV r8, [m8]
MOV r16, [m16]
MOV r32, [m32]
MOV r64, [m64]
现在,别开枪打我,但如何实现以下目标:MOV r24, [m24]
? (我很欣赏后者是不合法的.)
在我的示例中,我想移动字符"Pip"(即0x706950h)以注册rax
.
section .data ; Section containing initialized data
14 DogsName: db "PippaChips"
15 DogsNameLen: equ $-DogsName
我首先考虑可以分别移动字节,即首先移动一个字节,然后移动一个字,或它们的某种组合.但是,我无法引用eax
,rax
的上半部分",因此这下降到了第一个障碍,因为我最终将覆盖首先移动的任何数据.
我的解决方案:
26 mov al, byte [DogsName + 2] ; move the character p to register al
27 shl rax, 16 ; shift bits left by 16, clearing ax to receive characters pi
28 mov ax, word [DogsName] ; move the characters Pi to register ax
我可以将"Pip"声明为已初始化的数据项,但示例仅是一个示例,我想了解如何在汇编中引用24位或40、48….
是否还有类似于MOV r24, [m24]
的指令?与提供偏移量和指定大小运算符相反,有没有一种方法可以选择存储器地址的范围.如何从内存中移出3个字节以在ASM x86_64中注册?
NASM版本2.11.08 x86体系结构
通常,您会进行4字节的加载并掩盖您想要的字节附带的高垃圾 >或干脆忽略它,如果您正在使用不关心高位数据的数据. 哪个2的补码整数操作可以如果只需要结果的低位部分,可以在不将输入的高位清零的情况下使用它?
不同于存储 ,除非您进入未映射的页面,否则加载不应该"的数据永远不会带来正确性. (例如,如果db "pip"
位于页面的末尾,而未映射下一页.)但是在这种情况下,您知道是较长字符串的一部分,因此,唯一可能的缺点是性能如果较大的负载延伸到下一个缓存行(因此负载越过缓存行边界).
对于任何3个字节,始终可以安全地访问之前的字节或之后的字节(如果3个字节本身未在两条高速缓存行之间分割,则甚至不跨越高速缓存行边界).在运行时弄清楚这一点可能不值得,但是如果您知道在编译时的对齐方式,则可以选择其中一种
mov eax, [DogsName-1] ; if previous byte is in the same page/cache line
shr eax, 8
mov eax, [DogsName] ; if following byte is in the same page/cache line
and eax, 0x00FFFFFF
我假设您要,而不是与8或16位操作数的EAX/RAX现有高字节合并-size寄存器写入.如果确实要合并,请屏蔽旧值和OR
.或者,如果您是从[DogsName-1]
加载的,则所需的字节位于EAX的前3个位置,并且您希望合并到ECX中:shr ecx, 24
/shld ecx, eax, 24
将旧的高位字节向下移动到底部,然后移动它在移入3个新字节的同时返回. (很遗憾,没有shld
的内存源形式.半相关的:中.)shld
在Intel CPU(尤其是Sandybridge及更高版本:1 uop)上是快速的,但在AMD( http://agner.org/optimize/).
合并2个单独的负载
有很多方法可以做到这一点,但是不幸的是,在所有CPU上没有最快的方法. 部分寄存器写入在不同CPU上的行为不同.在Core2/Nehalem之外的其他CPU上,您的方式(将字节加载/移位/字加载到ax
中)相当好(组装后在读取eax
时将插入合并的uop会停滞不前).但是从movzx eax, byte [DogsName + 2]
开始,以打破对rax
的旧值的依赖.
您希望编译器生成的经典无处不在"代码将是:
DEFAULT REL ; compilers use RIP-relative addressing for static data; you should too.
movzx eax, byte [DogsName + 2] ; avoid false dependency on old EAX
movzx ecx, word [DogsName]
shl eax, 16
or eax, ecx
这需要一条额外的指令,但避免写入任何部分寄存器.但是,在Core2或Nehalem以外的CPU上,两次加载的最佳选择是写入ax
. (Core2之前的Intel P6无法运行x86-64代码,并且在编写ax
时,未进行部分寄存器重命名的CPU将合并为rax
). Sandybridge仍会重命名AX,但是合并仅花费1个uop,而不会停止,即与OR相同,但是在Core2/Nehalem上,前端会在插入合并uop时停顿3个周期.
Ivybridge及更高版本仅重命名AH
,而不重命名AX
或AL
,因此在这些CPU上,AX的负载是微融合的负载+合并. Agner Fog并未在Silvermont或Ryzen(或我查看的电子表格中的任何其他选项卡)上列出mov r16, m
的额外罚款,因此,大概其他未进行部分注册更名的CPU也会执行mov ax, [mem]
作为负载+合并.
movzx eax, byte [DogsName + 2]
shl eax, 16
mov ax, word [DogsName]
; using eax:
; Sandybridge: extra 1 uop inserted to merge
; core2 / nehalem: ~3 cycle stall (unless you don't use it until after the load retires)
; everything else: no penalty
实际上,可以在运行时高效地进行对齐测试.给定寄存器中的指针,除非地址的最后几个5或6位全为零,否则前一个字节在同一高速缓存行中. (即地址与缓存行的开头对齐).假设高速缓存行为64字节;当前所有的CPU都使用它,我认为不存在任何32字节行的x86-64 CPU. (而且我们仍然绝对避免跨页).
; pointer to m24 in RSI
; result: EAX = zero_extend(m24)
test sil, 111111b ; test all 6 low bits. There's no TEST r32, imm8, so REX r8, imm8 is shorter and never slower.
jz .aligned_by_64
mov eax, [rsi-1]
shr eax, 8
.loaded:
...
ret ; end of whatever large function this is part of
; unlikely block placed out-of-line to keep the common case fast
.aligned_by_64:
mov eax, [rsi]
and eax, 0x00FFFFFF
jmp .loaded
因此,在通常情况下,额外的成本只是一个未经测试的分支机构.
取决于CPU,输入和周围的代码,测试低12位(仅避免越过4k边界)将权衡页面中某些缓存行拆分的更好的分支预测,但仍然永远不会有缓存行分裂. (在这种情况下为test esi, (1<<12)-1
.与使用imm8
测试sil
不同,使用imm16
测试si
在Intel CPU上进行LCP停顿以节省1字节的代码是不值得的.当然,如果可以的话将指针放在ra/b/c/dx中,则不需要REX前缀,甚至test al, imm8
都有紧凑的2字节编码.)
您甚至可以无分支地执行此操作,但是与仅执行两个单独的加载相比显然不值得!
; pointer to m24 in RSI
; result: EAX = zero_extend(m24)
xor ecx, ecx
test sil, 7 ; might as well keep it within a qword if we're not branching
setnz cl ; ecx = (not_start_of_line) ? : 1 : 0
sub rsi, rcx ; normally rsi-1
mov eax, [rsi]
shl ecx, 3 ; cl = 8 : 0
shr eax, cl ; eax >>= 8 : eax >>= 0
; with BMI2: shrx eax, [rsi], ecx is more efficient
and eax, 0x00FFFFFF ; mask off to handle the case where we didn't shift.
真正的体系结构24位加载或存储
在架构上,x86没有24位加载或存储,其中以 integer 寄存器作为目标或源.正如布兰登指出的那样,MMX/SSE屏蔽了商店(例如 MASKMOVDQU
,而不是与 pmovmskb eax, xmm0
混淆)可以存储从MMX或XMM寄存器获得的24位,给定了仅设置了低3字节的向量掩码.但是它们几乎永远不会有用,因为它们速度慢并且总是具有NT提示(因此,它们在高速缓存周围进行写操作,并像movntdq
这样强制逐出). (AVX dword/qword屏蔽的加载/存储指令并不意味着NT,但是字节粒度不可用.)
AVX512BW(Skylake服务器)添加了vmovdqu8
,它为您提供了用于负载和存储的字节掩码,并为被屏蔽的字节提供了故障抑制功能. (即,如果未为16字节的加载在未映射的页面中包含字节,则不会出现段错误,只要未为该字节设置掩码位即可.但这确实会造成很大的影响).因此,微体系结构仍然是16字节的负载,但是对体系结构状态(即除性能以外的所有内容)的影响恰好是真正的3字节的加载/存储(使用正确的掩码). >
您可以在XMM,YMM或ZMM寄存器上使用它.
;; probably slower than the integer way, especially if you don't actually want the result in a vector
mov eax, 7 ; low 3 bits set
kmovw k1, eax ; hoist the mask setup out of a loop
; load: leave out the {z} to merge into the old xmm0 (or ymm0 / zmm0)
vmovdqu8 xmm0{k1}{z}, [rsi] ; {z}ero-masked 16-byte load into xmm0 (with fault-suppression)
vmovd eax, xmm0
; store
vmovd xmm0, eax
vmovdqu8 [rsi]{k1}, xmm0 ; merge-masked 16-byte store (with fault-suppression)
它与NASM 2.13.01组装在一起.如果您的NASM足够新以支持AVX512,则为IDK.您可以使用英特尔的软件开发仿真器(SDE)
这看起来很酷,因为将结果存入eax
仅需2 uops(一旦设置了蒙版). (不过, http://instlatx64.atw.hu/的电子表格的Skylake-X数据是什么,不包括带有掩码的vmovdqu8
,只有未屏蔽的形式.那些确实表明它仍然是单个uop负载,或像常规vmovdqu/a
)一样的微融合存储
但是,请注意如果16字节的加载会出现故障或越过缓存行边界,则速度会降低.我认为它是内部执行 加载操作,然后丢弃字节,如果需要抑制故障,则可能会产生昂贵的特殊情况.
此外,对于商店版本,请注意,被屏蔽的商店不会有效地转发到加载. (有关更多信息,请参阅英特尔的优化手册.)
脚注:
- 广泛存储是一个问题,因为即使您替换了旧值,您仍将执行非原子读取-修改-写入操作,例如,如果您放回的那个字节是锁,这可能会破坏事情. 不要存储在对象外部,除非您知道接下来会发生什么并且它是安全的,例如您可以可以
lock cmpxchg
将修改后的4字节值放入适当位置,以确保您不踩到另一个线程对额外字节的更新,但是显然,做2个单独的存储比在原子上的cmpxchg
重试循环在性能上要好得多.
I can move data items stored in memory, to a general-purpose register of my choosing, using the MOV
instruction.
MOV r8, [m8]
MOV r16, [m16]
MOV r32, [m32]
MOV r64, [m64]
Now, don’t shoot me, but how is the following achieved: MOV r24, [m24]
? (I appreciate the latter is not legal).
In my example, I want to move the characters "Pip", i.e. 0x706950h, to register rax
.
section .data ; Section containing initialized data
14 DogsName: db "PippaChips"
15 DogsNameLen: equ $-DogsName
I first considered that I could move the bytes separately, i.e. first a byte, then a word, or some combination thereof. However, I cannot reference the ‘top halves’ of eax
, rax
, so this falls down at the first hurdle, as I would end up over-writing whatever data was moved first.
My solution:
26 mov al, byte [DogsName + 2] ; move the character "p" to register al
27 shl rax, 16 ; shift bits left by 16, clearing ax to receive characters "pi"
28 mov ax, word [DogsName] ; move the characters "Pi" to register ax
I could just declare "Pip" as an initialized data item, but the example is just that, an example, I want to understand how to reference 24 bits in assembly, or 40, 48… for that matter.
Is there an instruction more akin to MOV r24, [m24]
? Is there a way to select a range of memory addresses, as opposed to providing the offset and specifying a size operator. How to move 3 bytes from memory to register in ASM x86_64?
NASM version 2.11.08 Architecture x86
Normally you'd do a 4-byte load and mask off the high garbage that came with the bytes you wanted, or simply ignore it if you're doing something with the data that doesn't care about high bits. Which 2's complement integer operations can be used without zeroing high bits in the inputs, if only the low part of the result is wanted?
Unlike stores, loading data that you "shouldn't" is never a problem for correctness unless you cross into an unmapped page. (E.g. if db "pip"
came at the end of a page, and the following page was unmapped.) But in this case, you know it's part of a longer string, so the only possible downside is performance if a wide load extends into the next cache line (so the load crosses a cache-line boundary). Is it safe to read past the end of a buffer within the same page on x86 and x64?
Either the byte before or the byte after will always be safe to access, for any 3 bytes (without even crossing a cache-line boundary if the 3 bytes themselves weren't split between two cache lines). Figuring this out at run-time is probably not worth it, but if you know the alignment at compile time, you can do either
mov eax, [DogsName-1] ; if previous byte is in the same page/cache line
shr eax, 8
mov eax, [DogsName] ; if following byte is in the same page/cache line
and eax, 0x00FFFFFF
I'm assuming you want to zero-extend the result into eax/rax, like 32-bit operand-size, instead of merging with the existing high byte(s) of EAX/RAX like 8 or 16-bit operand-size register writes. If you do want to merge, mask the old value and OR
. Or if you loaded from [DogsName-1]
so the bytes you want are in the top 3 positions of EAX, and you want to merge into ECX: shr ecx, 24
/ shld ecx, eax, 24
to shift the old top byte down to the bottom, then shift it back while shifting in the 3 new bytes. (There's no memory-source form of shld
, unfortunately. Semi-related: efficiently loading from two separate dwords into a qword.) shld
is fast on Intel CPUs (especially Sandybridge and later: 1 uop), but not on AMD (http://agner.org/optimize/).
Combining 2 separate loads
There are many ways to do this, but there's no single fastest way across all CPUs, unfortunately. Partial-register writes behave differently on different CPUs. Your way (byte load / shift / word-load into ax
) is fairly good on CPUs other than Core2/Nehalem (which will stall to inserting a merging uop when you read eax
after assembling it). But start with movzx eax, byte [DogsName + 2]
to break the dependency on the old value of rax
.
The classic "safe everywhere" code that you'd expect a compiler to generate would be:
DEFAULT REL ; compilers use RIP-relative addressing for static data; you should too.
movzx eax, byte [DogsName + 2] ; avoid false dependency on old EAX
movzx ecx, word [DogsName]
shl eax, 16
or eax, ecx
This takes an extra instruction, but avoids writing any partial registers. However, on CPUs other than Core2 or Nehalem, the best option for 2 loads is writing ax
. (Intel P6 before Core2 can't run x86-64 code, and CPUs without partial-register renaming will merge into rax
when writing ax
). Sandybridge does still rename AX, but the merge only costs 1 uop with no stalling, i.e. same as the OR, but on Core2/Nehalem the front-end stalls for about 3 cycles while inserting the merge uop.
Ivybridge and later only rename AH
, not AX
or AL
, so on those CPUs, the load into AX is a micro-fused load+merge. Agner Fog doesn't list an extra penalty for mov r16, m
on Silvermont or Ryzen (or any other tabs in the spreadsheet I looked at), so presumably other CPUs without partial-reg renaming also execute mov ax, [mem]
as a load+merge.
movzx eax, byte [DogsName + 2]
shl eax, 16
mov ax, word [DogsName]
; using eax:
; Sandybridge: extra 1 uop inserted to merge
; core2 / nehalem: ~3 cycle stall (unless you don't use it until after the load retires)
; everything else: no penalty
Actually, testing alignment at run-time can be done efficiently. Given a pointer in a register, the previous byte is in the same cache line unless the last few 5 or 6 bits of the address are all zero. (i.e. the address is aligned to the start of a cache line). Lets assume cache lines are 64 bytes; all current CPUs use that, and I don't think any x86-64 CPUs with 32-byte lines exist. (And we still definitely avoid page-crossing).
; pointer to m24 in RSI
; result: EAX = zero_extend(m24)
test sil, 111111b ; test all 6 low bits. There's no TEST r32, imm8, so REX r8, imm8 is shorter and never slower.
jz .aligned_by_64
mov eax, [rsi-1]
shr eax, 8
.loaded:
...
ret ; end of whatever large function this is part of
; unlikely block placed out-of-line to keep the common case fast
.aligned_by_64:
mov eax, [rsi]
and eax, 0x00FFFFFF
jmp .loaded
So in the common case, the extra cost is only one not-taken test-and-branch uop.
Depending on the CPU, the inputs, and the surrounding code, testing the low 12 bits (to only avoid crossing 4k boundaries) would trade off better branch prediction for some cache line splits within pages, but still never a cache-line split. (In that case test esi, (1<<12)-1
. Unlike testing sil
with an imm8
, testing si
with an imm16
is not worth the LCP stall on Intel CPUs to save 1 byte of code. And of course if you can have your pointer in ra/b/c/dx, you don't need a REX prefix, and there's even a compact 2-byte encoding for test al, imm8
.)
You could even do this branchlessly, but clearly not worth it vs. just doing 2 separate loads!
; pointer to m24 in RSI
; result: EAX = zero_extend(m24)
xor ecx, ecx
test sil, 7 ; might as well keep it within a qword if we're not branching
setnz cl ; ecx = (not_start_of_line) ? : 1 : 0
sub rsi, rcx ; normally rsi-1
mov eax, [rsi]
shl ecx, 3 ; cl = 8 : 0
shr eax, cl ; eax >>= 8 : eax >>= 0
; with BMI2: shrx eax, [rsi], ecx is more efficient
and eax, 0x00FFFFFF ; mask off to handle the case where we didn't shift.
True architectural 24-bit load or store
Architecturally, x86 has no 24-bit loads or stores with an integer register as the destination or source. As Brandon points out, MMX / SSE masked stores (like MASKMOVDQU
, not to be confused with pmovmskb eax, xmm0
) can store 24 bits from an MMX or XMM reg, given a vector mask with only the low 3 bytes set. But they're almost never useful because they're slow and always have an NT hint (so they write around the cache, and force eviction like movntdq
). (The AVX dword/qword masked load/store instruction don't imply NT, but aren't available with byte granularity.)
AVX512BW (Skylake-server) adds vmovdqu8
which gives you byte-masking for loads and stores with fault-suppression for bytes that are masked off. (I.e. you won't segfault if the 16-byte load includes bytes in an unmapped page, as long as the mask bits aren't set for that byte. But that does cause a big slowdown). So microarchitecturally it's still a 16-byte load, but the effect on architectural state (i.e. everything except performance) is exactly that of a true 3-byte load/store (with the right mask).
You can use it on XMM, YMM, or ZMM registers.
;; probably slower than the integer way, especially if you don't actually want the result in a vector
mov eax, 7 ; low 3 bits set
kmovw k1, eax ; hoist the mask setup out of a loop
; load: leave out the {z} to merge into the old xmm0 (or ymm0 / zmm0)
vmovdqu8 xmm0{k1}{z}, [rsi] ; {z}ero-masked 16-byte load into xmm0 (with fault-suppression)
vmovd eax, xmm0
; store
vmovd xmm0, eax
vmovdqu8 [rsi]{k1}, xmm0 ; merge-masked 16-byte store (with fault-suppression)
This assembles with NASM 2.13.01. IDK if your NASM is new enough to support AVX512. You can play with AVX512 without hardware using Intel's Software Development Emulator (SDE)
This looks cool because it's only 2 uops to get a result into eax
(once the mask is set up). (However, http://instlatx64.atw.hu/'s spreadsheet of data from IACA for Skylake-X doesn't include vmovdqu8
with a mask, only the unmasked forms. Those do indicate that it's still a single uop load, or micro-fused store just like a regular vmovdqu/a
)
But beware of slowdowns if a 16-byte load would have faulted or crossed a cache-line boundary. I think it internally does do the load and then discards the bytes, with a potentially-expensive special case if a fault needs to be suppressed.
Also, for the store version, beware that masked stores don't forward as efficiently to loads. (See Intel's optimization manual for more).
Footnotes:
- Wide stores are a problem because even if you replace the old value, you'd be doing a non-atomic read-modify-write, which could break things if that byte you put back was a lock, for example. Don't store outside of objects unless you know what comes next and that it's safe, e.g. padding that you put there to allow this. You could
lock cmpxchg
a modified 4-byte value into place, to make sure you're not stepping on another thread's update of the extra byte, but obviously doing 2 separate stores is much better for performance than an atomiccmpxchg
retry loop.
这篇关于如何将3个字节(24位)从内存移动到寄存器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!