在禁用优化的情况下进行编译时，为什么clang不使用内存目标x86指令?他们有效率吗?

x86绝对是您应该认为允许和有效之间存在任何联系的最后一种体系结构.它已经发展了很多，与ISA设计的硬件相差很远..但是总的来说，在大多数ISA上都是不正确的.例如PowerPC的某些实现(尤其是PlayStation 3中的Cell处理器)的微编码变量计数移位缓慢，但是该指令是PowerPC ISA的一部分，因此完全不支持该指令将非常痛苦，并且不值得="https://stackoverflow.com/questions/539836/emulating-variable-bit-shift-using-only-constant-shifts">使用多条指令，而不是让微代码在热循环之外进行操作您可能会编写拒绝使用或警告已知缓慢的指令的汇编器，例如 enter 或循环，但是有时您正在优化对于大小，不是速度，然后是缓慢，但是像 loop 这样的小指令很有用.( https://codegolf.stackexchange.com/questions/132981/tips-for-golfing-in-x86-x64-machine-code ，并查看x86机器代码答案，例如我的使用许多小而慢的指令(例如3 uop 1字节 xchg eax，r32 )以8字节的32位x86代码进行GCD循环>，甚至将3个字节的 inc / loop 替换为4字节的 test ecx，ecx / jnz ).对代码大小进行优化对于现实生活中的引导扇区或512字节或4k演示"之类的有趣事物很有用，它们可以绘制精美的图形并仅在少量可执行文件中播放声音.或者对于在启动期间仅执行一次的代码，较小的文件大小更好.或在程序的生命周期内很少执行，较小的I缓存占用空间比浪费大量缓存(并遭受前端停滞等待代码提取)要好.一旦指令字节实际到达CPU并进行解码，这可能会超过最大效率.尤其是与代码节省相比，两者之间的差异很小. 普通汇编程序只会抱怨不可编码的指令；绩效分析不是他们的工作.他们的工作是将文本转换为输出文件中的字节(可以选择包含目标文件元数据)，使您可以出于自己认为有用的目的创建所需的字节序列.要避免速度下降，需要一次查看1条以上的指令使代码变慢的大多数方法都包含一些指令，这些指令显然不是很糟糕，只是整体组合很慢.检查性能错误通常需要查看不止1条指令一次.例如此代码将在Intel P6系列CPU上造成部分寄存器停顿: mov ah，1添加eax，123 这些指令中的任何一个都可能是高效代码的一部分，因此，汇编器(只需要单独查看每个指令)就不会发出警告.尽管写AH完全是个问题.通常是个坏主意.也许更好的例子是带有的 partial-flag stall Snb系列产品便宜之前，在CPU上的 adc 循环中使用dec/jnz .> ADC/SBB和在某些CPU上的INC/DEC处于紧密循环中如果您正在寻找警告您有关昂贵指令的工具，则GAS并不是 .诸如IACA或LLVM-MCA之类的静态分析工具可能会帮助您在一段代码中向您显示昂贵的指令.(什么是IACA?如何使用它?和(如何)使用LLVM Machine Code Analyzer预测代码片段的运行时间?)它们的目的是分析循环，但是向它们提供代码块(无论它是否为循环体)将使它们向您显示每条指令在前端要花费多少微指令，以及有关延迟的一些信息.但是实际上，您必须对要优化的管道有更多了解，以了解每条指令的成本取决于周围的代码(它是否是长依赖链的一部分，以及整体瓶颈是什么).相关: 组装-如何为CPU评分通过延迟和吞吐量进行指令多少个CPU周期是每个汇编指令都需要吗? 在考虑以下方面的延迟时需要考虑哪些因素现代超标量处理器上的运算，如何手动计算它们? GCC/clang -O0 的最大作用是语句之间完全没有优化，将所有内容溢出到内存中并重新加载，因此每个C语句均由单独的块完全实现asm说明.(为了进行一致的调试，包括在任何断点处停止时修改C变量). 但是，即使在一个语句的asm块中， clang -O0 显然也会跳过优化通行证，该通行证决定使用CISC内存目标指令是否会获胜(鉴于当前调整).因此，clang最简单的代码生成趋向于将CPU用作负载存储机器，并使用单独的负载指令将内容存储在寄存器中. GCC -O0 可能会像您期望的那样编译主程序.(启用优化后，由于未使用 a ，它当然只能编译为 xor％eax，％eax / ret .) 主要:pushq％rbpmovq％rsp，％rbpmovl $ 5，-4(％rbp)addl $ 6，-4(％rbp)movl $ 0，％eaxpopq％rbp退回如何使用内存目标查看clang/LLVM add 我将这些函数放在Godbolt编译器上clang8.2 -O3 的资源管理器.每个函数编译为一个asm指令，对于x86-64，其默认值为 -mtune = generic .(因为现代x86 CPU解码的内存目标添加效率很高，最多为许多内部操作作为单独的加载/添加/存储指令，有时通过微融合加载/添加部件来减少.) void add_reg_to_mem(int * p，int b){* p + = b;}#我使用了AT& T语法，因为那是您使用的语法.英特尔语法比IMO更好地址％esi，(％rdi)退回void add_imm_to_mem(int * p){* p + = 3;}#gcc和clang -O3在这里都发出相同的asm，那里只有一个不错的选择addl $ 3，(％rdi)退回 gcc -O0 的输出完全是脑残，例如重新加载 p 两次，因为它在计算 +3 时会阻塞指针.我还可以使用全局变量(而不是指针)为编译器提供一些无法优化的东西. -O0 可能不会那么可怕. #gcc8.2 -O0输出...在制作了一个堆栈帧并将`p`从RDI溢出到-8(％rbp)之后movq -8(％rbp)，％rax#加载pmovl(％rax)，％eax#加载* p，破坏pleal 3(％rax)，％edx#edx = * p + 3movq -8(％rbp)，％rax#重新加载pmovl％edx，(％rax)#存储* p + 3 实际上，GCC甚至没有试图不吮吸，只是为了快速编译，并遵守将所有内容保留在语句之间的内存中的约束. clang -O0输出恰好不那么恐怖: #铛-O0...在制作了一个堆栈帧并将`p`从RDI溢出到-8(％rbp)之后movq -8(％rbp)，％rdi#重新加载pmovl(％rdi)，％eax#eax = * paddl $ 3，％eax#eax + = 3movl％eax，(％rdi)#* p = eax 另请参见如何删除噪声"是从GCC/clang程序集输出中获得的吗?，更多内容是关于编写无需优化即可编译成有趣的asm的函数.如果我使用 -m32 -mtune = pentium 进行编译，则gcc -O3会避免添加memory-dst: P5奔腾微体系结构(从1993年开始)>不解码为类似RISC的内部uops .复杂的指令需要更长的时间才能运行，并增加其有序的双问题-超标量流水线.因此，GCC避免使用它们，而是使用x86指令的更多RISCy子集，以使P5可以更好地流水线化. #gcc8.2 -O3 -m32 -mtune = pentiumadd_imm_to_mem(int *):movl4(％esp)，％eax#由于32位调用约定而从堆栈中加载pmovl(％eax)，％edx#* p + = 3实现为3条单独的指令addl $ 3，％edxmov％edx，(％eax)退回您可以在上面的Godbolt链接上自行尝试；那是哪里来的.只需在下拉菜单中将编译器更改为gcc并更改选项即可.不确定他们在这里实际上是一个胜利，因为他们背靠背.为了真正赢得胜利，gcc必须插入一些独立的指令.根据 Agner Fog的指令表，在-顺序P5需要3个时钟周期，但可以在U或V管道中配对.自从我阅读了他的微体系结构指南的P5 Pentium部分以来已经有一段时间了，但是顺序管道肯定必须按程序顺序 start 每条指令.(不过，慢速指令(包括存储)可以在其他指令开始后稍后完成.但是这里的添加和存储取决于前一条指令，因此它们肯定要等待).如果您感到困惑，英特尔仍将Pentium和Celeron品牌名称用于Skylake等低端现代CPU.这不是我们在说的.我们谈论的是原始的Pentium 微体系结构，而现代Pentium品牌的CPU甚至都与之无关.因为没有64位奔腾CPU，所以GCC拒绝了 -m32 而不使用 -mtune = pentium .第一代Xeon Phi使用Knight's Corner uarch，该产品基于顺序排列的P5 Pentium，并具有类似于AVX512的矢量扩展名.但是gcc似乎不支持 -mtune = knc .Clang确实有，但是选择使用内存目标在此处以及 -m32 -mtune = pentium .直到P5被淘汰(KNC除外)之后，LLVM项目才开始启动，而GCC是积极开发和调整的，而P5被广泛用于x86台式机.因此，gcc仍然了解一些P5调优知识也就不足为奇了，而LLVM并没有真正将其与现代x86区别对待，后者将内存目标指令解码为多个uops，并且可以无序执行它们.I wrote this simple assembly code, ran it and looked at the memory location using GDB: .text.global _main_main: pushq %rbp movl $5, -4(%rbp) addl $6, -4(%rbp) popq %rbp retIt's adding 5 to 6 directly in memory and according to GDB it worked. So this is performing math operations directly in memory instead of CPU registers.Now writing the same thing in C and compiling it to assembly turns out like this:... # clang output xorl %eax, %eax movl $0, -4(%rbp) movl $5, -8(%rbp) movl -8(%rbp), %ecx # load a addl $6, %ecx # a += 6 movl %ecx, -8(%rbp) # store a....It's moving them to a register before adding them together.So why don't we add directly in memory?Is it slower? If so, then why is adding directly in memory even allowed, why didn't the assembler complain about my assembly code in the beginning?Edit:Here is the C code for the second assembly block, I have disabled optimization when compiling.#include <iostream>int main(){ int a = 5; a+=6; return 0;} 解决方案 You disabled optimization, and you're surprised the asm looks inefficient? Well don't be. You've asked the compiler to compile quickly: short compile times instead of short run-times for the generated binary. And with debug-mode consistency.Yes, GCC and clang will use memory-destination add when tuning for modern x86 CPUs. It is efficient if you have no use for the add result being in a register. Obviously your hand-written asm has a major missed-optimization, though. movl $5+6, -4(%rbp) would be much more efficient, because both values are assemble-time constants so leaving the add until runtime is horrible. Just like with your anti-optimized compiler output.(Update: just noticed your compiler output included xor %eax,%eax, so this looks like clang/LLVM, not gcc like I initially guessed. Almost everything in this answer applies equally to clang, but gcc -O0 doesn't look for the xor-zeroing peephole optimization at -O0, using mov $0, %eax.)Fun fact: gcc -O0 will actually use addl $6, -4(%rbp) in your main.You already know from your hand-written asm that adding an immediate to memory is encodeable as an x86 add instruction, so the only question is whether gcc's/LLVM's optimizer decides to use it or not. But you disabled optimization.A memory-destination add doesn't perform the calculation "in memory", the CPU interally has to load/add/store. It doesn't disturb any of the architectural registers while doing so, but it doesn't just send the 6 to the DRAM to be added there. See also Can num++ be atomic for 'int num'? for the C and x86 asm details of memory destination ADD, with/without a lock prefix to make it appear atomic.There is computer-architecture research into putting ALUs into DRAM, so computation can happen in parallel instead of requiring all data pass through the memory bus to the CPU for any computation to happen. This is becoming an ever-larger bottleneck as memory sizes grow faster than memory bandwidth, and CPU throughput (with wide SIMD instructions) also grows faster than memory bandwidth. (Requiring more computational intensity (amount of ALU work per load/store) for the CPU to not stall. Fast caches help, but some problems have large working sets and are hard to apply cache-blocking for. Fast caches do mitigate the problem most of the time.)But as it stands now, add $6, -4(%rbp) decodes into load, add, and store uops inside your CPU. The load uses an internal temporary destination, not an architectural register.Modern x86 CPUs have some hidden internal logical registers that multi-uop instructions can use for temporaries. These hidden registers are renamed onto the physical registers in the issue/rename stage as they're allocated into the out-of-order back-end, but in the front end (decoder output, uop cache, IDQ) uops can only reference the "virtual" registers that represent the machine's logical state.So the multiple uops that memory-destination ALU instructions decode to are probably using hidden tmp registers.We know these exist for use by micro-code / multi-uop instructions: http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ calls them "extra architectural registers for internal use". They're not architectural in the sense of being part of the x86 machine state, only in the sense of being logical registers that the register-allocation-table (RAT) has to track for register renaming onto the physical register file. Their values aren't needed between x86 instructions, only for the uops within one x86 instruction, especially micro-coded ones like rep movsb (which checks the size and overlap, and uses 16 or 32-byte loads/stores if possible) but also for multi-uop memory+ALU instructions.Original 8086 wasn't out-of-order, or even pipelined. It could just load right into the ALU input, then when the ALU was done, store the result. It didn't need temporary "architectural" registers in its register file, just normal buffering between components. This is presumably how everything up to 486 worked. Maybe even Pentium.In this case add immediate to memory is the optimal choice, if we pretend that the value was already in memory. (Instead of just being stored from another immediate constant.)Modern x86 evolved from 8086. There are lots of slow ways to do things in modern x86 asm, but none of them can be disallowed without breaking backwards compatibility. For example the enter instruction was added back in 186 to support nested Pascal procedures, but is very slow now. The loop instruction has existed since 8086, but has been too slow for compilers to ever use since about 486 I think, maybe 386. (Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?)x86 is absolutely the last architecture where you should ever think there's any connection between being allowed and being efficient. It's evolved very far from the hardware the ISA was designed for. But in general it's not true on any most ISAs. e.g. some implementations of PowerPC (notably the Cell processor in PlayStation 3) have slow micro-coded variable-count shifts, but that instruction is part of the PowerPC ISA so not supporting the instruction at all would be very painful, and not worth using multiple instructions instead of letting the microcode do it, outside of hot loops.You could maybe write an assembler that refused to use, or warned about, known-slow instruction like enter or loop, but sometimes you're optimizing for size, not speed, and then slow but small instructions like loop are useful. (https://codegolf.stackexchange.com/questions/132981/tips-for-golfing-in-x86-x64-machine-code, and see x86 machine-code answers, like my GCD loop in 8 bytes of 32-bit x86 code using lots of small but slow instructions like 3-uop 1-byte xchg eax, r32, and even inc/loop as a 3-byte alternative to 4-byte test ecx,ecx/jnz). Optimizing for code-size is useful in real-life for boot-sectors, or for fun things like 512-byte or 4k "demos", which draw cool graphics and play sound in only tiny amounts of executables. Or for code that executes only once during startup, smaller file size is better. Or executes rarely during the lifetime of a program, smaller I-cache footprint is better than blowing away lots of cache (and suffering front-end stalls waiting for code fetch). That can outweigh being maximally efficient once the instruction bytes actually arrive at the CPU and are decoded. Especially if the difference there is small compared to the code-size saving.Normal assemblers will only complain about instructions that aren't encodeable; performance analysis is not their job. Their job is to turn text into bytes in an output file (optionally with object-file metadata), allowing you to create whatever byte sequence you want for whatever purpose you think might be useful.Avoiding slowdowns requires looking at more than 1 instruction at onceMost of the ways you can make your code slow involve instructions that aren't obviously bad, just the overall combination is slow. Checking for performance mistakes in general requires looking at much more than 1 instruction at a time.e.g. this code will cause a partial-register stall on Intel P6-family CPUs:mov ah, 1add eax, 123Either of these instructions on their own could potentially be part of efficient code, so an assembler (which only has to look at each instruction separately) isn't going to warn you. Although writing AH at all is pretty questionable; normally a bad idea. Maybe a better example would have been a partial-flag stall with dec/jnz in an adc loop, on CPUs before SnB-family made that cheap. Problems with ADC/SBB and INC/DEC in tight loops on some CPUsIf you're looking for a tool to warn you about expensive instructions, GAS is not it. Static analysis tools like IACA or LLVM-MCA might be some help to show you expensive instructions in a block of code. (What is IACA and how do I use it? and (How) can I predict the runtime of a code snippet using LLVM Machine Code Analyzer?) They're aimed at analyzing loops, but feeding them a block of code whether it's a loop body or not will get them to show you how many uops each instruction costs in the front-end, and maybe something about latency.But really you have to understand a bit more about the pipeline you're optimizing for to understand that the cost of each instruction depends on the surrounding code (whether it's part of a long dependency chain, and what the overall bottleneck is). Related:Assembly - How to score a CPU instruction by latency and throughputHow many CPU cycles are needed for each assembly instruction?What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?GCC/clang -O0's biggest effect is no optimization at all between statements, spilling everything to memory and reloading, so each C statement is fully implemented by a separate block of asm instructions. (For consistent debugging, including modifying C variables while stopped at any breakpoint).But even within the block of asm for one statement, clang -O0 apparently skips the optimization pass that decides whether using CISC memory-destination instructions instructions would be a win (given the current tuning). So clang's simplest code-gen tends to use the CPU as a load-store machine, with separate load instructions to get things in registers.GCC -O0 happens to compile your main like you might expect. (With optimization enabled, it of course compiles to just xor %eax,%eax/ret, because a is unused.)main: pushq %rbp movq %rsp, %rbp movl $5, -4(%rbp) addl $6, -4(%rbp) movl $0, %eax popq %rbp retHow to see clang/LLVM using memory-destination addI put these functions on the Godbolt compiler explorer with clang8.2 -O3. Each function compiled to one asm instruction, with the default -mtune=generic for x86-64. (Because modern x86 CPUs decode memory-destination add efficiently, to at most as many internal uops as separate load/add/store instructions, and sometimes fewer with micro-fusion of the load+add part.)void add_reg_to_mem(int *p, int b) { *p += b;} # I used AT&T syntax because that's what you were using. Intel-syntax is nicer IMO addl %esi, (%rdi) retvoid add_imm_to_mem(int *p) { *p += 3;} # gcc and clang -O3 both emit the same asm here, where there's only one good choice addl $3, (%rdi) retThe gcc -O0 output is just totally braindead, e.g. reloading p twice because it clobbers the pointer while calculating the +3. I could also have used global variables, instead of pointers, to give the compiler something it couldn't optimize away. -O0 for that would probably be a lot less terrible. # gcc8.2 -O0 output ... after making a stack frame and spilling `p` from RDI to -8(%rbp) movq -8(%rbp), %rax # load p movl (%rax), %eax # load *p, clobbering p leal 3(%rax), %edx # edx = *p + 3 movq -8(%rbp), %rax # reload p movl %edx, (%rax) # store *p + 3GCC is literally not even trying to not suck, just to compile quickly, and respect the constraint of keeping everything in memory between statements.The clang -O0 output happens to be less horrible for this: # clang -O0 ... after making a stack frame and spilling `p` from RDI to -8(%rbp) movq -8(%rbp), %rdi # reload p movl (%rdi), %eax # eax = *p addl $3, %eax # eax += 3 movl %eax, (%rdi) # *p = eaxSee also How to remove "noise" from GCC/clang assembly output? for more about writing functions that compile to interesting asm without optimizing away.If I compiled with -m32 -mtune=pentium, gcc -O3 would avoid memory-dst add:The P5 Pentium microarchitecture (from 1993) does not decode to RISC-like internal uops. Complex instructions take longer to run, and gum up its in-order dual-issue-superscalar pipeline. So GCC avoids them, using a more RISCy subset of x86 instructions that P5 can pipeline better.# gcc8.2 -O3 -m32 -mtune=pentiumadd_imm_to_mem(int*): movl 4(%esp), %eax # load p from the stack, because of the 32-bit calling convention movl (%eax), %edx # *p += 3 implemented as 3 separate instructions addl $3, %edx movl %edx, (%eax) retYou can try this yourself on the Godbolt link above; that's where this is from. Just change the compiler to gcc in the drop-down and change the options.Not sure it's actually much of a win here, because they're back-to-back. For it to be a real win, gcc would have to interleave some independent instructions. According to Agner Fog's instruction tables, add $imm, (mem) on in-order P5 takes 3 clock cycles, but is pairable in either U or V pipe. It's been a while since I read through the P5 Pentium section of his microarch guide, but the in-order pipeline definitely has to start each instruction in program order. (Slow instructions, including stores, can complete later, though, after other instructions have started. But here the add and store depend on the previous instruction, so they definitely have to wait).In case you're confused, Intel still uses the Pentium and Celeron brand names for low-end modern CPUs like Skylake. This is not what we're talking about. We're talking about the original Pentium microarchitecture, which modern Pentium-branded CPUs are not even related to.GCC refuses -mtune=pentium without -m32, because there are no 64-bit Pentium CPUs. First-gen Xeon Phi uses the Knight's Corner uarch, based on in-order P5 Pentium with vector extensions similar to AVX512 added. But gcc doesn't seem to support -mtune=knc. Clang does, but chooses to use memory-destination add here for that and for -m32 -mtune=pentium.The LLVM project didn't start until after P5 was obsolete (other than KNC), while gcc was actively developed and tweaked while P5 was in widespread use for x86 desktops. So it's not surprising that gcc still knows some P5 tuning stuff, while LLVM doesn't really treat it differently from modern x86 that decode memory-destination instructions to multiple uops, and can execute them out-of-order. 这篇关于在禁用优化的情况下进行编译时，为什么clang不使用内存目标x86指令?他们有效率吗?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！ 1403页，肝出来的..