本文介绍了C++ 中循环移位(旋转)操作的最佳实践的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

左移和右移运算符(<< 和 >>)在 C++ 中已经可用.但是,我不知道如何执行循环移位或旋转操作.

Left and right shift operators (<< and >>) are already available in C++.However, I couldn't find out how I could perform circular shift or rotate operations.

如何进行向左旋转"和向右旋转"等操作?

How can operations like "Rotate Left" and "Rotate Right" be performed?

这里向右旋转两次

Initial --> 1000 0011 0100 0010

应该导致:

Final   --> 1010 0000 1101 0000

举个例子会有帮助.

(编者注:如果旋转计数为零,则在 C 中表达旋转的许多常见方法都会遇到未定义的行为,或者编译为不仅仅是单个旋转机器指令.这个问题的答案应该记录最佳实践.)

(editor's note: Many common ways of expressing rotates in C suffer from undefined behaviour if the rotate count is zero, or compile to more than just a single rotate machine instruction. This question's answer should document best practices.)

推荐答案

另见 关于另一个轮换问题的答案的早期版本 提供有关 asm gcc/clang 为 x86 生成的内容的更多详细信息.

See also an earlier version of this answer on another rotate question with some more details about what asm gcc/clang produce for x86.

在 C 和 C++ 中表达旋转以避免任何未定义行为的对编译器最友好的方法似乎是 John Regehr 的实施.我已将其调整为按类型的宽度旋转(使用诸如 uint32_t 之类的固定宽度类型).

The most compiler-friendly way to express a rotate in C and C++ that avoids any Undefined Behaviour seems to be John Regehr's implementation. I've adapted it to rotate by the width of the type (using fixed-width types like uint32_t).

#include <stdint.h>   // for uint32_t
#include <limits.h>   // for CHAR_BIT
// #define NDEBUG
#include <assert.h>

static inline uint32_t rotl32 (uint32_t n, unsigned int c)
{
  const unsigned int mask = (CHAR_BIT*sizeof(n) - 1);  // assumes width is a power of 2.

  // assert ( (c<=mask) &&"rotate by type width or more");
  c &= mask;
  return (n<<c) | (n>>( (-c)&mask ));
}

static inline uint32_t rotr32 (uint32_t n, unsigned int c)
{
  const unsigned int mask = (CHAR_BIT*sizeof(n) - 1);

  // assert ( (c<=mask) &&"rotate by type width or more");
  c &= mask;
  return (n>>c) | (n<<( (-c)&mask ));
}

适用于任何无符号整数类型,而不仅仅是 uint32_t,因此您可以制作其他大小的版本.

Works for any unsigned integer type, not just uint32_t, so you could make versions for other sizes.

参见还有一个 C++11 模板版本,有很多安全检查(包括 static_assert 类型宽度是 2) 的幂,例如在某些 24 位 DSP 或 36 位大型机上就不是这种情况.

See also a C++11 template version with lots of safety checks (including a static_assert that the type width is a power of 2), which isn't the case on some 24-bit DSPs or 36-bit mainframes, for example.

我建议仅将模板用作名称明确包含旋转宽度的包装器的后端.整数提升规则意味着 rotl_template(u16 & 0x11UL, 7) 将进行 32 位或 64 位旋转,而不是 16 位(取决于 的宽度无符号长).甚至 uint16_t &uint16_t 被 C++ 的整数提升规则提升为 signed int,除了在 int 不比 uint16_t 宽的平台上.

I'd recommend only using the template as a back-end for wrappers with names that include the rotate width explicitly. Integer-promotion rules mean that rotl_template(u16 & 0x11UL, 7) would do a 32 or 64-bit rotate, not 16 (depending on the width of unsigned long). Even uint16_t & uint16_t is promoted to signed int by C++'s integer-promotion rules, except on platforms where int is no wider than uint16_t.

在86 下,这个版本(或 ROL R32,的imm8) 的编译器,因为编译器知道 x86 旋转和移位指令以与 C 源代码相同的方式屏蔽移位计数.

On x86, this version inlines to a single rol r32, cl (or rol r32, imm8) with compilers that grok it, because the compiler knows that x86 rotate and shift instructions mask the shift-count the same way the C source does.

编译器支持 x86 上的这种 UB-avoiding 习惯用法,用于 uint32_t xunsigned int n 用于可变计数移位:

Compiler support for this UB-avoiding idiom on x86, for uint32_t x and unsigned int n for variable-count shifts:

  • clang:从 clang3.5 开始识别可变计数轮换,在此之前多次轮班+或 insns.
  • gcc:自 gcc4.9 起被识别为可变计数旋转, 在此之前多次轮班+或insn.gcc5 和更高版本也优化了维基百科版本中的分支和掩码,仅使用 rorrol 指令进行变量计数.
  • icc:支持变量-计数自 ICC13 或更早版本开始循环.恒定计数循环使用 shld edi,edi,7 在某些 CPU(尤其是 AMD,但也有一些 Intel)上,它比 rol edi,7 更慢并且占用更多字节,当 BMI2 不可用于 rorx eax,edi,25 以保存 MOV 时.
  • MSVC:x86-64 CL19:仅识别恒定计数旋转.(维基百科习语被识别,但分支和 AND 没有被优化掉).在 x86(包括 x86-64)上使用 中的 _rotl/_rotr 内在函数.
  • clang: recognized for variable-count rotates since clang3.5, multiple shifts+or insns before that.
  • gcc: recognized for variable-count rotates since gcc4.9, multiple shifts+or insns before that. gcc5 and later optimize away the branch and mask in the wikipedia version, too, using just a ror or rol instruction for variable counts.
  • icc: supported for variable-count rotates since ICC13 or earlier. Constant-count rotates use shld edi,edi,7 which is slower and takes more bytes than rol edi,7 on some CPUs (especially AMD, but also some Intel), when BMI2 isn't available for rorx eax,edi,25 to save a MOV.
  • MSVC: x86-64 CL19: Only recognized for constant-count rotates. (The wikipedia idiom is recognized, but the branch and AND aren't optimized away). Use the _rotl / _rotr intrinsics from <intrin.h> on x86 (including x86-64).

ARM 的 gcc 使用 和 r1, r1, #31 进行可变计数轮换,但仍然使用单个指令进行实际轮换:rorr0, r0, r1.所以 gcc 没有意识到旋转计数本质上是模块化的.正如 ARM 文档所说,"ROR移位长度n,大于32与移位长度n-32"的ROR相同.我认为 gcc 在这里会感到困惑,因为 ARM 上的左/右移位会使计数饱和,因此移位 32 或更多将清除寄存器.(与 x86 不同,在 x86 中,移位屏蔽与旋转相同的计数).它可能决定在识别旋转习语之前需要一个 AND 指令,因为非循环移位在该目标上是如何工作的.

gcc for ARM uses an and r1, r1, #31 for variable-count rotates, but still does the actual rotate with a single instruction: ror r0, r0, r1. So gcc doesn't realize that rotate-counts are inherently modular. As the ARM docs say, "ROR with shift length, n, more than 32 is the same as ROR with shift length n-32". I think gcc gets confused here because left/right shifts on ARM saturate the count, so a shift by 32 or more will clear the register. (Unlike x86, where shifts mask the count the same as rotates). It probably decides it needs an AND instruction before recognizing the rotate idiom, because of how non-circular shifts work on that target.

当前的 x86 编译器仍然使用额外的指令来屏蔽 8 位和 16 位循环的变量计数,这可能与它们不避免 ARM 上的 AND 的原因相同.这是一个错过的优化,因为性能不依赖于任何 x86-64 CPU 上的旋转计数.(出于性能原因,计数屏蔽是在 286 中引入的,因为它以迭代方式处理移位,而不是像现代 CPU 那样具有恒定延迟.)

Current x86 compilers still use an extra instruction to mask a variable count for 8 and 16-bit rotates, probably for the same reason they don't avoid the AND on ARM. This is a missed optimization, because performance doesn't depend on the rotate count on any x86-64 CPU. (Masking of counts was introduced with 286 for performance reasons because it handled shifts iteratively, not with constant-latency like modern CPUs.)

顺便说一句,对于可变计数旋转,更喜欢右旋转,以避免使编译器执行 32-n 以在仅提供右旋转的 ARM 和 MIPS 等架构上实现左旋转.(这通过编译时常量计数进行了优化.)

BTW, prefer rotate-right for variable-count rotates, to avoid making the compiler do 32-n to implement a left rotate on architectures like ARM and MIPS that only provide a rotate-right. (This optimizes away with compile-time-constant counts.)

有趣的事实:ARM 并没有真正的专用移位/旋转指令,它只是带有 源操作数在 ROR 模式下通过桶形移位器:mov r0, r0, ror r1.因此,旋转可以折叠为 EOR 指令或其他指令的寄存器源操作数.

Fun fact: ARM doesn't really have dedicated shift/rotate instructions, it's just MOV with the source operand going through the barrel-shifter in ROR mode: mov r0, r0, ror r1. So a rotate can fold into a register-source operand for an EOR instruction or something.

确保对 n 和返回值使用无符号类型,否则它不会是轮换.(用于 x86 目标的 gcc 进行算术右移,在符号位的副本中移位而不是零,当您将两个移位的值 OR 放在一起时会导致问题.负有符号整数的右移是C 中实现定义的行为.)

Make sure you use unsigned types for n and the return value, or else it won't be a rotate. (gcc for x86 targets does arithmetic right shifts, shifting in copies of the sign-bit rather than zeroes, leading to a problem when you OR the two shifted values together. Right-shifts of negative signed integers is implementation-defined behaviour in C.)

另外,确保移位计数是无符号类型,因为带有符号类型的 (-n)&31 可能是一个补码或符号/大小,与您使用无符号或二进制补码获得的模块化 2^n 不同.(请参阅对 Regehr 博客文章的评论).对于 x 的每个宽度,unsigned int 在我看过的每个编译器上都做得很好.其他一些类型实际上会破坏某些编译器的习惯用法识别,所以不要只使用与 x 相同的类型.

Also, make sure the shift count is an unsigned type, because (-n)&31 with a signed type could be one's complement or sign/magnitude, and not the same as the modular 2^n you get with unsigned or two's complement. (See comments on Regehr's blog post). unsigned int does well on every compiler I've looked at, for every width of x. Some other types actually defeat the idiom-recognition for some compilers, so don't just use the same type as x.

某些编译器提供旋转的内在函数,如果可移植版本不能在您所针对的编译器上生成好的代码,则它比 inline-asm 好得多.我所知道的任何编译器都没有跨平台的内在函数.以下是一些 x86 选项:

Some compilers provide intrinsics for rotates, which is far better than inline-asm if the portable version doesn't generate good code on the compiler you're targeting. There aren't cross-platform intrinsics for any compilers that I know of. These are some of the x86 options:

  • 英特尔文档 提供 _rotl_rotl64 内在函数,右移也一样.MSVC 需要 ,而 gcc 需要 .#ifdef 负责 gcc 与 icc,但 clang 似乎没有在任何地方提供它们,除了在 MSVC 兼容模式下 -fms-extensions -fms-compatibility -fms-compatibility-version=17.00.它为他们发出的 asm 很糟糕(额外的掩蔽和 CMOV).
  • MSVC:_rotr8_rotr16.
  • gcc 和 icc(不是 clang): 还提供 __rolb/__rorb 用于 8 位左旋转/对,__rolw/__rorw(16 位)、__rold/__rord(32 位)、__rolq/__rorq(64 位,只为 64 位目标定义).对于窄轮换,实现使用 __builtin_ia32_rolhi...qi,但 32 位和 64 位轮换是使用 shift/or 定义的(没有针对 UB 的保护,因为ia32intrin.h 中的代码 只需要在 x86 的 gcc 上工作).GNU C 似乎没有任何跨平台的 __builtin_rotate 函数,就像它为 __builtin_popcount 所做的那样(它扩展到目标平台上的最佳状态,即使它不是一条指令).大多数情况下,您可以通过 idiom-recognition 获得好的代码.
  • Intel documents that <immintrin.h> provides _rotl and _rotl64 intrinsics, and same for right shift. MSVC requires <intrin.h>, while gcc require <x86intrin.h>. An #ifdef takes care of gcc vs. icc, but clang doesn't seem to provide them anywhere, except in MSVC compatibility mode with -fms-extensions -fms-compatibility -fms-compatibility-version=17.00. And the asm it emits for them sucks (extra masking and a CMOV).
  • MSVC: _rotr8 and _rotr16.
  • gcc and icc (not clang): <x86intrin.h> also provides __rolb/__rorb for 8-bit rotate left/right, __rolw/__rorw (16-bit), __rold/__rord (32-bit), __rolq/__rorq (64-bit, only defined for 64-bit targets). For narrow rotates, the implementation uses __builtin_ia32_rolhi or ...qi, but the 32 and 64-bit rotates are defined using shift/or (with no protection against UB, because the code in ia32intrin.h only has to work on gcc for x86). GNU C appears not to have any cross-platform __builtin_rotate functions the way it does for __builtin_popcount (which expands to whatever's optimal on the target platform, even if it's not a single instruction). Most of the time you get good code from idiom-recognition.
// For real use, probably use a rotate intrinsic for MSVC, or this idiom for other compilers.  This pattern of #ifdefs may be helpful
#if defined(__x86_64__) || defined(__i386__)

#ifdef _MSC_VER
#include <intrin.h>
#else
#include <x86intrin.h>  // Not just <immintrin.h> for compilers other than icc
#endif

uint32_t rotl32_x86_intrinsic(rotwidth_t x, unsigned n) {
  //return __builtin_ia32_rorhi(x, 7);  // 16-bit rotate, GNU C
  return _rotl(x, n);  // gcc, icc, msvc.  Intel-defined.
  //return __rold(x, n);  // gcc, icc.
  // can't find anything for clang
}
#endif

据推测,一些非 x86 编译器也具有内在函数,但我们不要扩展此社区 wiki 答案以将它们全部包含在内.(也许可以在 关于内在函数的现有答案 中这样做).

Presumably some non-x86 compilers have intrinsics, too, but let's not expand this community-wiki answer to include them all. (Maybe do that in the existing answer about intrinsics).

(此答案的旧版本建议使用 MSVC 特定的内联 asm(仅适用于 32 位 x86 代码),或 http://www.devx.com/tips/Tip/14043 对于 C 版本.评论正在回复.)

(The old version of this answer suggested MSVC-specific inline asm (which only works for 32bit x86 code), or http://www.devx.com/tips/Tip/14043 for a C version. The comments are replying to that.)

内联 asm 击败了许多优化尤其是 MSVC 风格,因为它强制存储/重新加载输入.精心编写的 GNU C inline-asm 旋转将允许计数成为编译时常量移位计数的直接操作数,但如果要移位的值也是编译时常量,它仍然无法完全优化掉内联后.https://gcc.gnu.org/wiki/DontUseInlineAsm.

Inline asm defeats many optimizations, especially MSVC-style because it forces inputs to be stored/reloaded. A carefully-written GNU C inline-asm rotate would allow the count to be an immediate operand for compile-time-constant shift counts, but it still couldn't optimize away entirely if the value to be shifted is also a compile-time constant after inlining. https://gcc.gnu.org/wiki/DontUseInlineAsm.

这篇关于C++ 中循环移位(旋转)操作的最佳实践的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-16 16:14
查看更多