什么时候汇编比 C 快?

使用内在函数，您可以以 C 编译器有机会了解发生了什么的方式重写函数.这允许代码被内联、寄存器分配、公共子表达式消除和常量传播也可以完成.与手写汇编代码相比，您将获得巨大的性能改进.供参考:VS.NET 编译器的定点 mul 的最终结果是:int inline FixedPointMul (int a, int b){返回(int)__ll_rshift(__emul(a，b)，16)；}定点除法的性能差异更大.通过编写几行汇编代码，我对除法繁重的定点代码进行了高达 10 倍的改进.使用 Visual C++ 2013 为两种方式提供相同的汇编代码. 2007 年的 gcc4.1 也很好地优化了纯 C 版本.(Godbolt 编译器资源管理器没有安装任何早期版本的 gcc，但据推测，即使没有内在函数，更旧的 GCC 版本也可以做到这一点.)在 Godbolt 编译器浏览器.(不幸的是，它没有任何足够老的编译器来从简单的纯 C 版本生成糟糕的代码.)现代 CPU 可以做 C 根本没有运算符的事情，例如 popcnt 或位扫描以查找第一个或最后一个设置位.(POSIX 有一个 ffs() 函数，但它的语义与 x86 bsf/bsr 不匹配.参见 https://en.wikipedia.org/wiki/Find_first_set).某些编译器有时可以识别一个循环，该循环计算整数中设置的位数并将其编译为 popcnt 指令(如果在编译时启用)，但使用 popcnt 指令要可靠得多code>__builtin_popcnt 在 GNU C 中，或者在 x86 上，如果您只针对具有 SSE4.2 的硬件:_mm_popcnt_u32 来自 .或在 C++ 中，分配给 std::bitset 并使用 .count().(在这种情况下，语言已经找到了一种方法，可以通过标准库可移植地公开 popcount 的优化实现，这种方式始终编译为正确的内容，并且可以利用目标支持的任何内容.)另见 https://en.wikipedia.org/wiki/Hamming_weight#Language_support.类似地，ntohl 可以在某些具有它的 C 实现上编译为 bswap(x86 32 位字节交换以进行字节序转换).内在函数或手写 asm 的另一个主要领域是使用 SIMD 指令进行手动矢量化.编译器对于像 dst[i] += src[i] * 10.0; 这样的简单循环来说还不错，但是当事情变得更复杂时，它们通常做得很糟糕或根本不自动矢量化.例如，您不太可能了解如何使用 SIMD 实现 atoi? 由编译器根据标量代码自动生成.One of the stated reasons for knowing assembler is that, on occasion, it can be employed to write code that will be more performant than writing that code in a higher-level language, C in particular. However, I've also heard it stated many times that although that's not entirely false, the cases where assembler can actually be used to generate more performant code are both extremely rare and require expert knowledge of and experience with assembly.This question doesn't even get into the fact that assembler instructions will be machine-specific and non-portable, or any of the other aspects of assembler. There are plenty of good reasons for knowing assembly besides this one, of course, but this is meant to be a specific question soliciting examples and data, not an extended discourse on assembler versus higher-level languages.Can anyone provide some specific examples of cases where assembly will be faster than well-written C code using a modern compiler, and can you support that claim with profiling evidence? I am pretty confident these cases exist, but I really want to know exactly how esoteric these cases are, since it seems to be a point of some contention. 解决方案 Here is a real world example: Fixed point multiplies on old compilers.These don't only come handy on devices without floating point, they shine when it comes to precision as they give you 32 bits of precision with a predictable error (float only has 23 bit and it's harder to predict precision loss). i.e. uniform absolute precision over the entire range, instead of close-to-uniform relative precision (float).Modern compilers optimize this fixed-point example nicely, so for more modern examples that still need compiler-specific code, seeGetting the high part of 64 bit integer multiplication: A portable version using uint64_t for 32x32 => 64-bit multiplies fails to optimize on a 64-bit CPU, so you need intrinsics or __int128 for efficient code on 64-bit systems._umul128 on Windows 32 bits: MSVC doesn't always do a good job when multiplying 32-bit integers cast to 64, so intrinsics helped a lot.C doesn't have a full-multiplication operator (2N-bit result from N-bit inputs). The usual way to express it in C is to cast the inputs to the wider type and hope the compiler recognizes that the upper bits of the inputs aren't interesting:// on a 32-bit machine, int can hold 32-bit fixed-point integers.int inline FixedPointMul (int a, int b){ long long a_long = a; // cast to 64 bit. long long product = a_long * b; // perform multiplication return (int) (product >> 16); // shift by the fixed point bias}The problem with this code is that we do something that can't be directly expressed in the C-language. We want to multiply two 32 bit numbers and get a 64 bit result of which we return the middle 32 bit. However, in C this multiply does not exist. All you can do is to promote the integers to 64 bit and do a 64*64 = 64 multiply.x86 (and ARM, MIPS and others) can however do the multiply in a single instruction. Some compilers used to ignore this fact and generate code that calls a runtime library function to do the multiply. The shift by 16 is also often done by a library routine (also the x86 can do such shifts).So we're left with one or two library calls just for a multiply. This has serious consequences. Not only is the shift slower, registers must be preserved across the function calls and it does not help inlining and code-unrolling either.If you rewrite the same code in (inline) assembler you can gain a significant speed boost.In addition to this: using ASM is not the best way to solve the problem. Most compilers allow you to use some assembler instructions in intrinsic form if you can't express them in C. The VS.NET2008 compiler for example exposes the 32*32=64 bit mul as __emul and the 64 bit shift as __ll_rshift.Using intrinsics you can rewrite the function in a way that the C-compiler has a chance to understand what's going on. This allows the code to be inlined, register allocated, common subexpression elimination and constant propagation can be done as well. You'll get a huge performance improvement over the hand-written assembler code that way.For reference: The end-result for the fixed-point mul for the VS.NET compiler is:int inline FixedPointMul (int a, int b){ return (int) __ll_rshift(__emul(a,b),16);}The performance difference of fixed point divides is even bigger. I had improvements up to factor 10 for division heavy fixed point code by writing a couple of asm-lines.Using Visual C++ 2013 gives the same assembly code for both ways.gcc4.1 from 2007 also optimizes the pure C version nicely. (The Godbolt compiler explorer doesn't have any earlier versions of gcc installed, but presumably even older GCC versions could do this without intrinsics.)See source + asm for x86 (32-bit) and ARM on the Godbolt compiler explorer. (Unfortunately it doesn't have any compilers old enough to produce bad code from the simple pure C version.)Modern CPUs can do things C doesn't have operators for at all, like popcnt or bit-scan to find the first or last set bit. (POSIX has a ffs() function, but its semantics don't match x86 bsf / bsr. See https://en.wikipedia.org/wiki/Find_first_set).Some compilers can sometimes recognize a loop that counts the number of set bits in an integer and compile it to a popcnt instruction (if enabled at compile time), but it's much more reliable to use __builtin_popcnt in GNU C, or on x86 if you're only targeting hardware with SSE4.2: _mm_popcnt_u32 from <immintrin.h>.Or in C++, assign to a std::bitset<32> and use .count(). (This is a case where the language has found a way to portably expose an optimized implementation of popcount through the standard library, in a way that will always compile to something correct, and can take advantage of whatever the target supports.) See also https://en.wikipedia.org/wiki/Hamming_weight#Language_support.Similarly, ntohl can compile to bswap (x86 32-bit byte swap for endian conversion) on some C implementations that have it.Another major area for intrinsics or hand-written asm is manual vectorization with SIMD instructions. Compilers are not bad with simple loops like dst[i] += src[i] * 10.0;, but often do badly or don't auto-vectorize at all when things get more complicated. For example, you're unlikely to get anything like How to implement atoi using SIMD? generated automatically by the compiler from scalar code. 这篇关于什么时候汇编比 C 快?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！上岸，阿里云！

Intrinsics

什么时候汇编比 C 快?