At most, we get:andBut exactly how does GCC favor one specific architecture, when bulding, while still being capable of running the build on other (usually older) architectures, albeit slower?I only know of one thing (but I'm no computer scientist) which would be capable of such, and that's a CPU dispatcher. However, it doesn't seem (for me) that mtune is generating a dispatcher behind the scenes, and instead some other mechanism is probably in effect.I feel that way for two reasons:Searching "gcc mtune cpu dispatcher" doesn't find anything relevant; andIf it was based on dispatcher, I think it could be smarter (even if by some option other than mtune) and test for cpuid to detect supported instructions at runtime, instead of relying on a named architecture which is provided at build time.So how does it work really? 解决方案 -mtune doesn't create a dispatcher, it doesn't need one: we are already telling the compiler what architecture we are targeting.From the GCC docs:This means that GCC won't use instructions available only on cpu-type 1 but it will generate code that run optimally on cpu-type.To understand this last statement is necessary to understand the difference between architecture and micro-architecture.The architecture implies an ISA (Instruction Set Architecture) and that's not influenced by the -mtune.The micro-architecture is how the architecture is implemented in hardware.For an equal instruction set (read: architecture), a code sequence may run optimally on a CPU (read micro-architecture) but not on another due to the internal details of the implementation.This can go as far as having a code sequence being optimal only on one micro-architecture.When generating the machine code often GCC has a degree of freedom in choosing how to order the instructions and what variant to use.It will use a heuristic to generate a sequence of instructions that run fast on the most common CPUs, sometime it will sacrifice a 100% optimal solution for CPU x if that will penalise CPUs y, z and w.When we use -mtune=x we are fine tuning the output of GCC for CPU x thereby producing a code that is 100% optimal (from the GCC perspective) on that CPU.As a concrete example consider how this code is compiled:float bar(float a[4], float b[4]){ for (int i = 0; i < 4; i++) { a[i] += b[i]; } float r=0; for (int i = 0; i < 4; i++) { r += a[i]; } return r;}The a[i] += b[i]; is vectorised (if the vectors don't overlap) differently when targeting a Skylake or a Core2:Skylake movups xmm0, XMMWORD PTR [rsi] movups xmm2, XMMWORD PTR [rdi] addps xmm0, xmm2 movups XMMWORD PTR [rdi], xmm0 movss xmm0, DWORD PTR [rdi]Core2 pxor xmm0, xmm0 pxor xmm1, xmm1 movlps xmm0, QWORD PTR [rdi] movlps xmm1, QWORD PTR [rsi] movhps xmm1, QWORD PTR [rsi+8] movhps xmm0, QWORD PTR [rdi+8] addps xmm0, xmm1 movlps QWORD PTR [rdi], xmm0 movhps QWORD PTR [rdi+8], xmm0 movss xmm0, DWORD PTR [rdi]The main difference is how an xmm register is loaded, on a Core2 it is loaded with two loads using movlps and movhps instead of using a single movups.The two loads approach is better on a Core2 micro-architecture, if you take a look at the Agner Fog's instructions tables you'll see that movups is decoded into 4 uops and has a latency of 2 cycles while each movXps is 1 uop and 1 cycle of latency.This is probably due to the fact that 128-bit accesses were split into two 64-bit accesses at the time.On Skylake the opposite is true: movups performs better than two movXps.So we have to pick up one.In general, GCC picks up the first variant because Core2 is an old micro-architecture, but we can override this with -mtune.1 Instruction set is selected with other switches.
