本文介绍了如何在装配中旋转值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要在汇编x86 64位中实现一个功能,而我无法更改该功能:

I am implementing a function in assembly x86 64 bits, which I am unable to alter:

   unsigned long rotate(unsigned long val, unsigned long num, unsigned long direction);

方向-左侧为1,右侧为0.

direction- 1 is left and 0 is right.

这是我的向左移动的代码,但是最后一关没有用.有谁可以帮助我吗.

This is my code to shift left but its not working the last bit is off. Can someone help me please.

  rotate:
  push rbp
  push rdi
  push rsi
  push rdx
  mov rbp, rsp
  sub rsp, 16
  cmp rdx, 1
  je shift_left


shift_left:
   mov rax, rdi
   shl rax, cl
   mov rax, rax
   mov rcx, rdi
   sub cl, 64
   shl rcx, cl
   or rax rdx
   mov rax, rax
   add rsp, 16
   #I pop all the registers used and ret

推荐答案

x86具有旋转指令.使用 rol rax,cl 进行旋转向左旋转,然后 ror rax,cl 向右旋转.

x86 has rotate instructions. Use rol rax, cl to rotate left, and ror rax, cl to rotate right.

似乎您没有意识到 cl rcx / ecx 的低字节.因此, shl rcx,cl 正在移动移位计数.您的功能过于复杂,但是在学习时这是正常的.需要实践才能找到可以用很少的说明即可实现的简单的潜在问题.

It seems you didn't realize that cl is the low byte of rcx / ecx. Thus shl rcx, cl is shifting the shift-count. Your function is over-complicated, but that's normal when you're just learning. It takes practice to find the simple underlying problem that you can implement in few instructions.

此外,我认为 mov rcx,rdi 应该是 mov rcx,rsi .IDK mov rax,rax 应该是什么;它只是一个无人操作.

Also, I think mov rcx, rdi was supposed to be mov rcx, rsi. IDK what mov rax,rax was supposed to be; its just a no-op.

调用左旋转和右旋转的不同函数会更加有效,除非您实际上需要 direction 作为运行时变量,而不仅仅是运行时常量1或0.

It would be significantly more efficient to call different functions for rotate-left vs. rotate-right, unless you actually need direction to be a runtime variable that isn't just a build-time constant 1 or 0.

或者使其无分支,有条件地执行 cl = 64-cl ,因为 n 的左旋转与的右旋转相同> 64-n .并且由于旋转指令掩盖了计数(并且旋转仍然是模块化的),因此您实际上可以只执行 -n 而不是 64-n .(请参阅 C ++中循环移位(旋转)操作的最佳做​​法,用于某些使用 -n 而不是 32-n 的C,并编译为单个旋转指令.

Or to make it branchless, conditionally do cl = 64-cl, because a left-rotate by n is the same thing as a right-rotate by 64-n. And because rotate instructions mask the count (and rotate is modular anyway), you can actually just do -n instead of 64-n. (See Best practices for circular shift (rotate) operations in C++ for some C that uses -n instead of 32-n, and compiles to a single rotate instruction).

TL:DR由于旋转对称,因此可以通过反计数来向另一个方向旋转.正如 @njuffa指出的您可以使用带符号的班次计数来编写该函数,其中负数表示以另一种方式旋转,因此调用方将在其中传递 num -num 第一名.

TL:DR Because of rotate symmetry, you can rotate in the other direction just by negating the count. As @njuffa points out, you could have written the function with a signed shift count where negative means rotate the other way, so the caller would pass you num or -num in the first place.

请注意,在您的代码中, sub cl,64 对下一个 shl 的移位计数没有影响,因为64位 shl 已经用 cl&掩盖了计数63 .

Note that in your code, sub cl, 64 has no effect on the shift count of the next shl, because 64-bit shl already masks the count with cl & 63.

我制作了一个C版本,以查看编译器会做什么(在 Godbolt编译器浏览器).gcc有一个有趣的想法:双向旋转并使用 cmov 选择正确的结果.之所以这么糟,是因为在Intel SnB系列CPU上,可变计数的移位/旋转为3 oups.(因为如果计数结果为 0 ,则它们必须保持标记不变.请参见,它也适用于旋转.)

I made a C version to see what compilers would do (on the Godbolt compiler explorer). gcc has an interesting idea: rotate both ways and use a cmov to pick the right result. This kinda sucks because variable-count shifts/rotates are 3 uops on Intel SnB-family CPUs. (Because they have to leave the flags unmodified if the count turns out to be 0. See the shift section of this answer, all of it applies to rotates as well.)

不幸的是,BMI2仅添加了立即计数版本的 rorx 和可变计数 shlx / shrx ,而不是可变计数no-标志旋转.

Unfortunately BMI2 only added an immediate-count version of rorx, and variable-count shlx/shrx, not variable-count no-flags rotate.

无论如何,基于这些想法,这是实现x86-64 System V ABI/调用约定(其中允许函数破坏input-arg寄存器和 r10 / r11 ).我假设您使用的是x86-64 SysV ABI(例如Linux或OS X)平台,因为您似乎正在使用 rdi rsi rdx 表示前3个参数(或至少尝试这样做),而您的 long 为64位.

Anyway, based on those ideas, here's a good way to implement your function for the x86-64 System V ABI / calling convention (where functions are allowed to clobber the input-arg registers and r10 / r11). I assume you're on a platform that uses the x86-64 SysV ABI (like Linux or OS X) because you appear to be using rdi, rsi, and rdx for the first 3 args (or at least trying to), and your long is 64 bits.

    ;; untested
    ;; rotate(val (rdi), num (rsi), direction (rdx))
rotate:
    xor     ecx, ecx
    sub     ecx, esi        ; -num

    test    edx, edx
    mov     rax, rdi        ; put val in the retval register

    cmovnz  ecx, esi        ; cl =  direction ? num : -num
    rol     rax, cl         ; works as a rotate-right by 64-num if direction is 0
    ret

xor-zero/sub通常比mov/neg好,因为xor-zeroing不在关键路径上.不过,在Ryzen上, mov / neg 更好,它具有零延迟整数 mov ,并且仍需要ALU uop来进行异或归零.但是,如果ALU uops不是您的瓶颈,那还是可以的.这在Intel Sandybridge上是一个明显的胜利(其中 xor -zeroing与NOP一样便宜),并且在没有零延迟 mov (例如Silvermont/KNL或AMD Bulldozer系列).

xor-zero / sub is often better than mov / neg because the xor-zeroing is off the critical path. mov / neg is better on Ryzen, though, which has zero-latency integer mov and still needs an ALU uop to do xor-zeroing. But if ALU uops aren't your bottleneck, this is still fine. It's a clear win on Intel Sandybridge (where xor-zeroing is as cheap as a NOP), and also a latency win on other CPUs that don't have zero-latency mov (like Silvermont/KNL, or AMD Bulldozer-family).

cmov 在Intel百老汇之前是2 oups.如果不是更好的话,用2的补码bithack替代xor/sub/test/cmov可能一样好. -num =〜num + 1 .

cmov is 2 uops on Intel pre-Broadwell. A 2's complement bithack alternative to xor/sub/test/cmov might be just as good if not better. -num = ~num + 1.

rotate:
    dec     edx             ; convert direction = 0 / 1 into  -1 / 0
    mov     ecx, esi        ; couldn't figure out how to avoid this with  lea  ecx, [rdx-1] or something

    xor     ecx, edx        ; (direction==0) ? ~num : num  ; NOT = xor with all-ones
    sub     ecx, edx        ; (direction==0) ? ~num + 1 : num + 0;
                            ; conditional negation using -num = ~num + 1.    (subtracting -1 is the same as adding 1)

    mov     rax, rdi        ; put val in the retval register
    rol     rax, cl         ; works as a rotate-right by 64-num if direction is 0
    ret

如果内联,则将具有更多优势,因此 num 可能已经在 ecx 中,这使其比其他选项(在代码大小和uop计数方面)更短

This would have more of an advantage if inlined so num could already be in ecx, making this shorter than the other options (in code-size and uop count).

Haswell上的延迟

Latency on Haswell

  • 从准备好方向 cl 进行 rol 的过程:3个周期( dec /xor / sub ).与其他版本中的 test / cmov 相同.(但是在Broadwell/Skylake test / cmov 上,从 direction cl 仅具有2个周期的延迟)
  • 从准备好 num 到准备好 cl :2个周期: mov (0)+ xor (1)+ sub (1),因此 num 仍有空间在1个周期后准备就绪.这比Haswell上的 cmov 更好,后者的 sub (1)+ cmov (2)= 3个周期.但是在Broadwell/Skylake上,任何一种方式都只有2c.
  • From direction being ready to cl being ready for rol: 3 cycles (dec / xor / sub). Same as test / cmov in the other version. (But on Broadwell/Skylake test/cmov only has 2 cycle latency from direction to cl)
  • From num being ready to cl being ready: 2 cycles: mov(0) + xor(1) + sub(1), so there's room for num to be ready 1 cycle later. This is better than with cmov on Haswell where it's sub(1) + cmov(2) = 3 cycles. But on Broadwell/Skylake, it's only 2c either way.

在Broadway之前,总的前端uop数量更好,因为我们避免使用 cmov .我们用 xor -zeroing换了 mov ,这在Sandybridge上更糟,但在其他地方都差不多.(除了它位于 num 的关键路径上,这对于没有零延迟 mov 的CPU来说很重要.)

The total front-end uop count is better on pre-Broadwell, because we avoid cmov. We traded an xor-zeroing for a mov, which is worse on Sandybridge, but about equal everywhere else. (Except that it's on the critical path for num, which matters for CPUs without zero-latency mov.)

顺便说一句,如果在 direction 上的分支非常可预测,则分支实现实际上可能会更快.但是通常这意味着最好内联 rol ror 指令会更好.

BTW, a branching implementation could actually be faster if the branch on direction is very predictable. But usually that means it would have been better to just inline a rol or ror instruction.

或者这一个:gcc的输出,其中删除了多余的和ecx,63 .在某些CPU上应该是不错的选择,但是与上面的相比没有太大的优势.(而且显然在包括Skylake在内的主流Intel Sandybridge系列CPU上更糟.)

Or this one: gcc's output with the redundant and ecx, 63 removed. It should be pretty good on some CPUs, but doesn't have much advantage compared to the above. (And is clearly worse on mainstream Intel Sandybridge-family CPUs including Skylake.)

;; not good on Intel SnB-family
;; rotate(val (rdi), num (rsi), direction (rdx))
rotate:
    mov     ecx, esi
    mov     rax, rdi

    rol     rax, cl         ; 3 uops
    ror     rdi, cl         ; false-dependency on flags on Intel SnB-family

    test    edx, edx        ; look at the low 32 bits for 0 / non-0
    cmovz   rax, rdi        ; direction=0 means use the rotate-right result
    ret

错误的依赖关系仅适用于标志设置的对象;我认为 ror rdi,cl rdi 结果与前面的 rol rax,cl 的标志合并uop无关.(请参阅 SHL/SHR r,cl延迟低于吞吐量).但是所有微指令都需要p0或p6,因此会出现资源冲突,从而限制了指令级并行性.

The false dependency is only for the flag-setting uops; I think the rdi result of ror rdi,cl is independent of the flag-merge uop of the preceding rol rax,cl. (See SHL/SHR r,cl latency is lower than throughput). But all the uops require p0 or p6, so there will be resource conflicts that limit instruction-level parallelism.

呼叫者在 edi 中为您传递了一个已签名的轮换计数.或根据需要将其命名为 rdi ;您会忽略它的低6位,而实际上只是在 [0,63 ]范围内进行左旋转,但这与支持范围为的左右旋转相同[-63,+63] .(较大的值会包裹在该范围内).

Caller passes you a signed rotate count in edi. Or call it rdi if you want; you ignore all but the low 6 bits of it, and you actually just do a left-rotate in the range [0, 63, but that's the same as supporting left and right rotates with range [-63, +63]. (With larger values wrapping into that range).

例如 -32 的arg是 0xffffffe0 ,它掩盖为 0x20 ,即32.在任一方向上旋转32是相同的操作.

e.g. an arg of -32 is 0xffffffe0, which masks down to 0x20, which is 32. Rotating by 32 in either direction is the same operation.

rotate:
    mov  rax, rdi
    mov  ecx, esi
    rol  rax, cl
    ret

唯一有效的方法是向调用方内联以避免 mov call / ret 指令.(或者对于恒定计数旋转,使用立即旋转计数,这使其成为Intel CPU上的单uup指令.)

The only way this could be any more efficient is inlining into the caller to avoid the mov and call/ret instructions. (Or for constant-count rotates, using an immediate rotate count which makes it a single-uop instruction on Intel CPUs.)

这篇关于如何在装配中旋转值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-03 07:03