问题描述
我要在汇编x86 64位中实现一个功能,而我无法更改该功能:
I am implementing a function in assembly x86 64 bits, which I am unable to alter:
unsigned long rotate(unsigned long val, unsigned long num, unsigned long direction);
方向-左侧为1,右侧为0.
direction- 1 is left and 0 is right.
这是我的向左移动的代码,但是最后一关没有用.有谁可以帮助我吗.
This is my code to shift left but its not working the last bit is off. Can someone help me please.
rotate:
push rbp
push rdi
push rsi
push rdx
mov rbp, rsp
sub rsp, 16
cmp rdx, 1
je shift_left
shift_left:
mov rax, rdi
shl rax, cl
mov rax, rax
mov rcx, rdi
sub cl, 64
shl rcx, cl
or rax rdx
mov rax, rax
add rsp, 16
#I pop all the registers used and ret
推荐答案
x86具有旋转指令.使用 rol rax,cl
进行旋转向左旋转,然后 ror rax,cl
向右旋转.
x86 has rotate instructions. Use rol rax, cl
to rotate left, and ror rax, cl
to rotate right.
似乎您没有意识到 cl
是 rcx
/ ecx
的低字节.因此, shl rcx,cl
正在移动移位计数.您的功能过于复杂,但是在学习时这是正常的.需要实践才能找到可以用很少的说明即可实现的简单的潜在问题.
It seems you didn't realize that cl
is the low byte of rcx
/ ecx
. Thus shl rcx, cl
is shifting the shift-count. Your function is over-complicated, but that's normal when you're just learning. It takes practice to find the simple underlying problem that you can implement in few instructions.
此外,我认为 mov rcx,rdi
应该是 mov rcx,rsi
.IDK mov rax,rax
应该是什么;它只是一个无人操作.
Also, I think mov rcx, rdi
was supposed to be mov rcx, rsi
. IDK what mov rax,rax
was supposed to be; its just a no-op.
调用左旋转和右旋转的不同函数会更加有效,除非您实际上需要 direction
作为运行时变量,而不仅仅是运行时常量1或0.
It would be significantly more efficient to call different functions for rotate-left vs. rotate-right, unless you actually need direction
to be a runtime variable that isn't just a build-time constant 1 or 0.
或者使其无分支,有条件地执行 cl = 64-cl
,因为 n
的左旋转与的右旋转相同> 64-n
.并且由于旋转指令掩盖了计数(并且旋转仍然是模块化的),因此您实际上可以只执行 -n
而不是 64-n
.(请参阅 C ++中循环移位(旋转)操作的最佳做法,用于某些使用 -n
而不是 32-n
的C,并编译为单个旋转指令.
Or to make it branchless, conditionally do cl = 64-cl
, because a left-rotate by n
is the same thing as a right-rotate by 64-n
. And because rotate instructions mask the count (and rotate is modular anyway), you can actually just do -n
instead of 64-n
. (See Best practices for circular shift (rotate) operations in C++ for some C that uses -n
instead of 32-n
, and compiles to a single rotate instruction).
TL:DR由于旋转对称,因此可以通过反计数来向另一个方向旋转.正如 @njuffa指出的,您可以使用带符号的班次计数来编写该函数,其中负数表示以另一种方式旋转,因此调用方将在其中传递 num
或 -num
第一名.
TL:DR Because of rotate symmetry, you can rotate in the other direction just by negating the count. As @njuffa points out, you could have written the function with a signed shift count where negative means rotate the other way, so the caller would pass you num
or -num
in the first place.
请注意,在您的代码中, sub cl,64
对下一个 shl
的移位计数没有影响,因为64位 shl
已经用 cl&掩盖了计数63
.
Note that in your code, sub cl, 64
has no effect on the shift count of the next shl
, because 64-bit shl
already masks the count with cl & 63
.
我制作了一个C版本,以查看编译器会做什么(在 Godbolt编译器浏览器).gcc有一个有趣的想法:双向旋转并使用 cmov
选择正确的结果.之所以这么糟,是因为在Intel SnB系列CPU上,可变计数的移位/旋转为3 oups.(因为如果计数结果为 0
,则它们必须保持标记不变.请参见,它也适用于旋转.)
I made a C version to see what compilers would do (on the Godbolt compiler explorer). gcc has an interesting idea: rotate both ways and use a cmov
to pick the right result. This kinda sucks because variable-count shifts/rotates are 3 uops on Intel SnB-family CPUs. (Because they have to leave the flags unmodified if the count turns out to be 0
. See the shift section of this answer, all of it applies to rotates as well.)
不幸的是,BMI2仅添加了立即计数版本的 rorx
和可变计数 shlx
/ shrx
,而不是可变计数no-标志旋转.
Unfortunately BMI2 only added an immediate-count version of rorx
, and variable-count shlx
/shrx
, not variable-count no-flags rotate.
无论如何,基于这些想法,这是实现x86-64 System V ABI/调用约定(其中允许函数破坏input-arg寄存器和 r10
/ r11
).我假设您使用的是x86-64 SysV ABI(例如Linux或OS X)平台,因为您似乎正在使用 rdi
, rsi
和 rdx
表示前3个参数(或至少尝试这样做),而您的 long
为64位.
Anyway, based on those ideas, here's a good way to implement your function for the x86-64 System V ABI / calling convention (where functions are allowed to clobber the input-arg registers and r10
/ r11
). I assume you're on a platform that uses the x86-64 SysV ABI (like Linux or OS X) because you appear to be using rdi
, rsi
, and rdx
for the first 3 args (or at least trying to), and your long
is 64 bits.
;; untested
;; rotate(val (rdi), num (rsi), direction (rdx))
rotate:
xor ecx, ecx
sub ecx, esi ; -num
test edx, edx
mov rax, rdi ; put val in the retval register
cmovnz ecx, esi ; cl = direction ? num : -num
rol rax, cl ; works as a rotate-right by 64-num if direction is 0
ret
xor-zero/sub通常比mov/neg好,因为xor-zeroing不在关键路径上.不过,在Ryzen上, mov
/ neg
更好,它具有零延迟整数 mov
,并且仍需要ALU uop来进行异或归零.但是,如果ALU uops不是您的瓶颈,那还是可以的.这在Intel Sandybridge上是一个明显的胜利(其中 xor
-zeroing与NOP一样便宜),并且在没有零延迟 mov
(例如Silvermont/KNL或AMD Bulldozer系列).
xor-zero / sub is often better than mov / neg because the xor-zeroing is off the critical path. mov
/ neg
is better on Ryzen, though, which has zero-latency integer mov
and still needs an ALU uop to do xor-zeroing. But if ALU uops aren't your bottleneck, this is still fine. It's a clear win on Intel Sandybridge (where xor
-zeroing is as cheap as a NOP), and also a latency win on other CPUs that don't have zero-latency mov
(like Silvermont/KNL, or AMD Bulldozer-family).
cmov
在Intel百老汇之前是2 oups.如果不是更好的话,用2的补码bithack替代xor/sub/test/cmov可能一样好. -num =〜num + 1
.
cmov
is 2 uops on Intel pre-Broadwell. A 2's complement bithack alternative to xor/sub/test/cmov might be just as good if not better. -num = ~num + 1
.
rotate:
dec edx ; convert direction = 0 / 1 into -1 / 0
mov ecx, esi ; couldn't figure out how to avoid this with lea ecx, [rdx-1] or something
xor ecx, edx ; (direction==0) ? ~num : num ; NOT = xor with all-ones
sub ecx, edx ; (direction==0) ? ~num + 1 : num + 0;
; conditional negation using -num = ~num + 1. (subtracting -1 is the same as adding 1)
mov rax, rdi ; put val in the retval register
rol rax, cl ; works as a rotate-right by 64-num if direction is 0
ret
如果内联,则将具有更多优势,因此 num
可能已经在 ecx
中,这使其比其他选项(在代码大小和uop计数方面)更短
This would have more of an advantage if inlined so num
could already be in ecx
, making this shorter than the other options (in code-size and uop count).
Haswell上的延迟
Latency on Haswell
- 从准备好
方向
到cl
进行rol
的过程:3个周期(dec
/xor
/sub
).与其他版本中的test
/cmov
相同.(但是在Broadwell/Skylaketest
/cmov
上,从direction
到cl
仅具有2个周期的延迟) - 从准备好
num
到准备好cl
:2个周期:mov
(0)+xor
(1)+sub
(1),因此num
仍有空间在1个周期后准备就绪.这比Haswell上的cmov
更好,后者的sub
(1)+cmov
(2)= 3个周期.但是在Broadwell/Skylake上,任何一种方式都只有2c.
- From
direction
being ready tocl
being ready forrol
: 3 cycles (dec
/xor
/sub
). Same astest
/cmov
in the other version. (But on Broadwell/Skylaketest
/cmov
only has 2 cycle latency fromdirection
tocl
) - From
num
being ready tocl
being ready: 2 cycles:mov
(0) +xor
(1) +sub
(1), so there's room fornum
to be ready 1 cycle later. This is better than withcmov
on Haswell where it'ssub
(1) +cmov
(2) = 3 cycles. But on Broadwell/Skylake, it's only 2c either way.
在Broadway之前,总的前端uop数量更好,因为我们避免使用 cmov
.我们用 xor
-zeroing换了 mov
,这在Sandybridge上更糟,但在其他地方都差不多.(除了它位于 num
的关键路径上,这对于没有零延迟 mov
的CPU来说很重要.)
The total front-end uop count is better on pre-Broadwell, because we avoid cmov
. We traded an xor
-zeroing for a mov
, which is worse on Sandybridge, but about equal everywhere else. (Except that it's on the critical path for num
, which matters for CPUs without zero-latency mov
.)
顺便说一句,如果在 direction
上的分支非常可预测,则分支实现实际上可能会更快.但是通常这意味着最好内联 rol
或 ror
指令会更好.
BTW, a branching implementation could actually be faster if the branch on direction
is very predictable. But usually that means it would have been better to just inline a rol
or ror
instruction.
或者这一个:gcc的输出,其中删除了多余的和ecx,63
.在某些CPU上应该是不错的选择,但是与上面的相比没有太大的优势.(而且显然在包括Skylake在内的主流Intel Sandybridge系列CPU上更糟.)
Or this one: gcc's output with the redundant and ecx, 63
removed. It should be pretty good on some CPUs, but doesn't have much advantage compared to the above. (And is clearly worse on mainstream Intel Sandybridge-family CPUs including Skylake.)
;; not good on Intel SnB-family
;; rotate(val (rdi), num (rsi), direction (rdx))
rotate:
mov ecx, esi
mov rax, rdi
rol rax, cl ; 3 uops
ror rdi, cl ; false-dependency on flags on Intel SnB-family
test edx, edx ; look at the low 32 bits for 0 / non-0
cmovz rax, rdi ; direction=0 means use the rotate-right result
ret
错误的依赖关系仅适用于标志设置的对象;我认为 ror rdi,cl
的 rdi
结果与前面的 rol rax,cl
的标志合并uop无关.(请参阅 SHL/SHR r,cl延迟低于吞吐量).但是所有微指令都需要p0或p6,因此会出现资源冲突,从而限制了指令级并行性.
The false dependency is only for the flag-setting uops; I think the rdi
result of ror rdi,cl
is independent of the flag-merge uop of the preceding rol rax,cl
. (See SHL/SHR r,cl latency is lower than throughput). But all the uops require p0 or p6, so there will be resource conflicts that limit instruction-level parallelism.
呼叫者在 edi
中为您传递了一个已签名的轮换计数.或根据需要将其命名为 rdi
;您会忽略它的低6位,而实际上只是在 [0,63
]范围内进行左旋转,但这与支持范围为的左右旋转相同[-63,+63]
.(较大的值会包裹在该范围内).
Caller passes you a signed rotate count in edi
. Or call it rdi
if you want; you ignore all but the low 6 bits of it, and you actually just do a left-rotate in the range [0, 63
, but that's the same as supporting left and right rotates with range [-63, +63]
. (With larger values wrapping into that range).
例如 -32
的arg是 0xffffffe0
,它掩盖为 0x20
,即32.在任一方向上旋转32是相同的操作.
e.g. an arg of -32
is 0xffffffe0
, which masks down to 0x20
, which is 32. Rotating by 32 in either direction is the same operation.
rotate:
mov rax, rdi
mov ecx, esi
rol rax, cl
ret
唯一有效的方法是向调用方内联以避免 mov
和 call
/ ret
指令.(或者对于恒定计数旋转,使用立即旋转计数,这使其成为Intel CPU上的单uup指令.)
The only way this could be any more efficient is inlining into the caller to avoid the mov
and call
/ret
instructions. (Or for constant-count rotates, using an immediate rotate count which makes it a single-uop instruction on Intel CPUs.)
这篇关于如何在装配中旋转值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!