本文介绍了循环"xorl%edx,%eax;"的目的是什么? shrl $ 1,%edx“?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下x86汇编代码:

  movl   8(%ebp), %edx  //get an argument from the caller
  movl   $0, %eax
  testl  %edx, %edx
  je     .L1
.L2:                   // what's the purpose of this loop body?
  xorl   %edx, %eax
  shrl   $1, %edx
  jne    .L2
.L1:
  andl   $1, %eax

教科书给出的相应C代码如下

int f1(unsigned x)
{
    int y = 0;
    while(x != 0) {
        __________;
    }
    return __________;
 }

这本书要求读者填补空白,并回答它有什么作用?"的问题.

我无法在一个C表达式中组合循环体.我可以说出循环体的作用,但是我对循环体的用途一无所知.教科书还说%eax在这里存储返回值.所以...的目的是什么

andl  $1, %eax

我也不知道.

解决方案

看起来整个循环的目的是对32位arg中的所有位进行XOR.即计算 奇偶校验 .

从最后一条指令(and $1,%eax)开始倒退,我们知道结果的低位很重要.

考虑到这一点,xor %edx,%eax会变得更清晰:将%edx的当前低位与xc3进行异或.高垃圾没关系.

shr循环直到所有x位都移出为止.我们总是可以循环32次以获取所有位,但是这比停止x为0时效率低.(由于XOR的工作原理,我们不需要对0位进行实际的XOR;这没有效果.)


一旦我们知道函数的作用,使用聪明的/紧凑的C语法填充C就成为一种练习.起初我以为y ^= (x>>=1);可以放入循环中,但是第一次使用它时,会在x 之前使用它.

我在一个C语句中看到的唯一方法是使用,运算符(它确实引入了顺序点,因此可以安全地在左侧阅读x并在,的右侧进行修改).因此,y ^= x, x>>=1;适合.

或者,对于更具可读性的代码,只需作弊并用;将两个语句放在同一行.

int f1(unsigned x) {
    int y = 0;
    while(x != 0) {
        y ^= x;  x>>=1;
    }
    return y & 1;
 }

使用 gcc5.3 -O3在Godbolt编译器资源管理器上.问题的代码优化为mov $0, %eax,并优化了gcc对ret指令的愚蠢复制. (或者也许使用了没有做到这一点的早期版本的gcc.)


循环效率很低:这是一种有效的方法:

我们不需要O(n)复杂度的循环(其中n是x的宽度).相反,我们可以获得O(log2(n))的复杂性,并实际上利用x86技巧仅执行其中的前两个步骤.

我省略了操作数大小的后缀,以获取由寄存器确定的指令. (除了xorw可以明确显示16位异或.)

#untested
parity:
    # no frame-pointer boilerplate

    xor       %eax,%eax        # zero eax (so the upper 24 bits of the int return value are zeroed).  And yes, this is more efficient than mov $0, %eax
                               # so when we set %al later, the whole of %eax will be good.

    movzwl    4(%esp), %edx      # load low 16 bits of `x`.  (zero-extend into the full %edx is for efficiency.  movw 4(%esp), %dx would work too.
    xorw      6(%esp), %dx       # xor the high 16 bits of `x`
    # Two loads instead of a load + copy + shift is probably a win, because cache is fast.
    xor       %dh, %dl           # xor the two 8 bit halves, setting PF according to the result
    setnp      %al               # get the inverse of the CPU's parity flag.  Remember that the rest of %eax is already zero, so the result is already zero-extended to 32-bits (int return value)
    ret

是的,没错, x86具有已更新的奇偶校验标志(PF)从根据结果设置标志"的每条指令的结果的低8位开始,例如 xor .

我们使用np条件是因为PF = 1表示偶校验:所有位的异或=0.对于偶校验,我们需要取反值返回0.

要利用它,我们通过将高半部分降低到低半部分并进行组合来进行SIMD样式的水平缩减,重复两次以将32位减少为8位.

正如我在.


或者,如@EOF所指出的,如果CPUID POPCNT功能位已设置,您可以使用popcnt并测试低位以查看设置的位数是偶数还是奇数. (另一种看待方式:xor是不带进位的加法运算,因此低位是相同的,无论您将所有位进行异或运算,还是将所有位水平地相加).

GNU C还具有__builtin_parity__builtin_popcnt,如果您告诉编译器编译目标支持它(使用-march=...-mpopcnt),则它们会使用硬件指令,否则会编译为该指令的有效序列目标机器.英特尔内部函数始终按照机器指令进行编译,而不是回退序列,并且在没有适当的-mpopcnt目标选项的情况下使用它们会导致编译时错误.

不幸的是,gcc无法将纯C循环识别为奇偶校验计算,而是对其进行了优化.一些编译器(例如clang和可能的gcc)可以识别某些类型的popcount惯用语,并将它们优化为popcnt指令,但是这种情况下不会发生这种模式识别. :(

在Godbolt上查看这些内容.

int parity_gnuc(unsigned x) {
    return  __builtin_parity(x);
}
    # with -mpopcnt, compiles the same as below
    # without popcnt, compiles to the same upper/lower half XOR algorithm I used, and a setnp
    # using one load and mov/shift for the 32->16 step, and still %dh, %dl for the 16->8 step.

#ifdef __POPCNT__
#include <immintrin.h>
int parity_popcnt(unsigned x) {
    return  _mm_popcnt_u32(x) & 1;
}
#endif

    # gcc does compile this to the optimal code:
    popcnt    4(%esp), %eax
    and       $1, %eax
    ret


另请参见 x86 标记的其他链接Wiki.

I have the following x86 assembly code:

  movl   8(%ebp), %edx  //get an argument from the caller
  movl   $0, %eax
  testl  %edx, %edx
  je     .L1
.L2:                   // what's the purpose of this loop body?
  xorl   %edx, %eax
  shrl   $1, %edx
  jne    .L2
.L1:
  andl   $1, %eax

The corresponding C code that the textbook gives as follows

int f1(unsigned x)
{
    int y = 0;
    while(x != 0) {
        __________;
    }
    return __________;
 }

The book asks readers to fill the blank and answer the question of "What does it do?"

I can't combine the loop body in one C expression. I can tell what the loop body does, but I have no idea about its purpose. The textbook also says that %eax here stores the return value. So...what's the purpose of

andl  $1, %eax

I also have no idea.

解决方案

It looks like the purpose of the whole loop is to XOR all the bits together in the 32-bit arg. i.e. calculate the parity.

Working backwards from the last instruction (and $1,%eax), we know that only the low bit of the result matters.

With that in mind, the xor %edx,%eax becomes clearer: xor the current low bit of %edx into %eax. The high garbage doesn't matter.

The shr loops until all of x's bits have been shifted out. We could always loop 32 times to get all the bits, but that would be less efficient than stopping once x is 0. (Because of how XOR works, we don't need to actual XOR in the 0 bits; that has no effect.)


Once we know what the function does, filling in the C becomes an exercise in clever / compact C syntax. I thought at first that y ^= (x>>=1); would fit inside the loop, but that shifts x before using it the first time.

The only way I see to do it in one C statement is with the , operator (which does introduce a sequence point, so it's safe to read x on the left side and modify it on the right side of a ,). So, y ^= x, x>>=1; fits.

Or, for more readable code, just cheat and put two statements on the same line with a ;.

int f1(unsigned x) {
    int y = 0;
    while(x != 0) {
        y ^= x;  x>>=1;
    }
    return y & 1;
 }

This compiles to essentially the same asm as shown in the question, using gcc5.3 -O3 on the Godbolt compiler explorer. The question's code de-optimizes the xor-zeroing idiom to a mov $0, %eax, and optimizes gcc's silly duplication of ret instructions. (Or maybe used an earlier version of gcc that didn't do that.)


The loop is very inefficient: this is an efficient way:

We don't need a loop with O(n) complexity (where n is the width in bits of x). Instead, we can get O(log2(n)) complexity, and actually take advantage of x86 tricks to only do the first 2 steps of that.

I've left off the operand-size suffix for instructions where it's determined by the registers. (Except for xorw to make the 16-bit xor explicit.)

#untested
parity:
    # no frame-pointer boilerplate

    xor       %eax,%eax        # zero eax (so the upper 24 bits of the int return value are zeroed).  And yes, this is more efficient than mov $0, %eax
                               # so when we set %al later, the whole of %eax will be good.

    movzwl    4(%esp), %edx      # load low 16 bits of `x`.  (zero-extend into the full %edx is for efficiency.  movw 4(%esp), %dx would work too.
    xorw      6(%esp), %dx       # xor the high 16 bits of `x`
    # Two loads instead of a load + copy + shift is probably a win, because cache is fast.
    xor       %dh, %dl           # xor the two 8 bit halves, setting PF according to the result
    setnp      %al               # get the inverse of the CPU's parity flag.  Remember that the rest of %eax is already zero, so the result is already zero-extended to 32-bits (int return value)
    ret

Yes, that's right, x86 has a parity flag (PF) that's updated from the low 8 bits of the result of every instruction that "sets flags according to the result", like xor.

We use the np condition because PF = 1 means even parity: xor of all bits = 0. We need the inverse to return 0 for even parity.

To take advantage of it, we do a SIMD-style horizontal reduction by bringing the high half down to the low half and combining, repeating twice to reduce 32 bits to 8 bits.

Zeroing eax (with an xor) before the instruction that sets flags is slightly more efficient than doing set-flags / setp %al / movzbl %al, %eax, as I explained in What is the best way to set a register to zero in x86 assembly: xor, mov or and?.


Or, as @EOF points out, if the CPUID POPCNT feature bit is set, you can use popcnt and test the low bit to see if the number of set bits is even or odd. (Another way to look at this: xor is add-without-carry, so the low bit is the same whether you xor all the bits together or add all the bits together horizontally).

GNU C also has __builtin_parity and __builtin_popcnt which use the hardware instruction if you tell the compiler that the compile target supports it (with -march=... or -mpopcnt), but otherwise compile to an efficient sequence for the target machine. The Intel intrinsics always compile to the machine instruction, not a fallback sequence, and it's a compile-time error to use them without the appropriate -mpopcnt target option.

Unfortunately gcc doesn't recognize the pure-C loop as being a parity calculation and optimize it into this. Some compilers (like clang and probably gcc) can recognize some kinds of popcount idioms, and optimize them into the popcnt instruction, but that kind of pattern recognition doesn't happen in this case. :(

See these on godbolt.

int parity_gnuc(unsigned x) {
    return  __builtin_parity(x);
}
    # with -mpopcnt, compiles the same as below
    # without popcnt, compiles to the same upper/lower half XOR algorithm I used, and a setnp
    # using one load and mov/shift for the 32->16 step, and still %dh, %dl for the 16->8 step.

#ifdef __POPCNT__
#include <immintrin.h>
int parity_popcnt(unsigned x) {
    return  _mm_popcnt_u32(x) & 1;
}
#endif

    # gcc does compile this to the optimal code:
    popcnt    4(%esp), %eax
    and       $1, %eax
    ret


See also other links in the x86 tag wiki.

这篇关于循环"xorl%edx,%eax;"的目的是什么? shrl $ 1,%edx“?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-02 00:17