mtune如何工作？

本文介绍了mtune如何工作？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述 29岁程序员，3月因学历无情被辞！有这个相关问题：海湾合作委员会：march与mtune有什么不同？然而，现有的答案并没有比GCC手册本身更进一步。至多，我们得到： lockquote 如果你使用 -mtune ，那么编译器将会生成在任何一个上都能运行的代码，但会支持在上运行速度最快的指令序列，您指定的特定CPU。和但是 GCC是否支持一种特定的架构，在建立时，仍然能够运行其他（通常是较旧的）架构上的构建，尽管速度较慢？我只知道有一件事（但我不是计算机科学家），这将是有能力的，这是一个CPU调度员。然而，在我看来， mtune 在幕后生成一个调度器，而其他一些机制可能是有效的。我觉得这样做有两个原因：搜索gcc mtune cpu dispatcher找不到任何相关的和如果它基于调度程序，我认为它可能更聪明（即使通过除 mtune 之外的某个选项）并测试for cpuid 在运行时检测支持的指令，而不是依赖构建时提供的命名架构。那么它是如何工作的？解决方案 -mtune 不会创建调度程序，它不需要一个：我们已经告诉编译器我们的目标是什么架构。从 GCC docs ： $ 这意味着GCC不会使用仅在 cpu-type 1 上可用的指令，但它将生成在 cpu-type 上以最佳方式运行的代码。理解这最后一条语句对于理解体系结构和微体系结构之间的区别是必要的。体系结构隐含了ISA（指令集体系结构）这不受 -mtune 的影响。微架构是这种架构如何在硬件中实现的。对于相同的指令集（读取：体系结构），由于实现的内部细节，代码序列可以在CPU（读取微架构）上运行最佳，但不会在另一个上运行。只要在一个微架构上代码序列是最优的，就可以实现这一点。在生成机器代码时，通常GCC具有一定的自由度在选择如何订购指令和使用什么变种。它将使用启发式方法生成一系列指令，这些指令可以在最常见的CPU上快速运行，有时它会牺牲100％的最佳解决方案CPU x 如果这会惩罚CPU y ， z 和 w 。 $ b 当我们使用 -mtune = x 时，我们正在微调GCC for CPU的输出 x ，从而在该CPU上产生一个100％最佳的代码（从GCC的角度来看）。作为一个具体的例子，考虑编译代码的方式： float bar（float a [4]，float b [4]） { for（int i = 0; i {a [i] + = b [i]; } float r = 0; for（int i = 0; i { r + = a [i]; } return r; a [i] + = b [i ]; 在定向Skylake或Core2时被矢量化（如果矢量不重叠）： Skylake em> movups xmm0，XMMWORD PTR [rsi] movups xmm2，XMMWORD PTR [rdi] addps xmm0，xmm2 movups XMMWORD PTR [rdi]，xmm0 movss xmm0，DWORD PTR [rdi] Core2 pxor xmm0，xmm0 pxor xmm1， xmm1 movlps xmm0，QWORD PTR [rdi] movlps xmm1，QWORD PTR movhps xmm1，QWORD PTR [rsi + 8] movhps xmm0，QWORD PTR [rdi +8] addps xmm0，xmm1 movlps QWORD PTR [rdi]，xmm0 movhps QWORD PTR [rdi + 8]，xmm0 movss xmm0，DWORD PTR [rdi] 主要区别在于 xmm 寄存器加载，在Core2上它使用 movlps 和 movhps 而不是使用单个 movups 。在Core2微架构上，两种加载方法更好，如果您查看Agner Fog的指令表，您会看到 movups 被解码为4 uops，延迟时间为2个周期，而每个 movXps 为1个延迟和1个延迟周期。这可能是由于128- bit 在Skylake上的情况恰恰相反： movups 的性能优于两个 movXps 。所以我们必须选择一个。总的来说，GCC选择了第一个变体，因为Core2是一个旧的微架构，但是我们可以用 -mtune 来覆盖它。 1 使用其他开关选择指令集。 There's this related question: GCC: how is march different from mtune?However, the existing answers don't go much further than the GCC manual itself. At most, we get:andBut exactly how does GCC favor one specific architecture, when bulding, while still being capable of running the build on other (usually older) architectures, albeit slower?I only know of one thing (but I'm no computer scientist) which would be capable of such, and that's a CPU dispatcher. However, it doesn't seem (for me) that mtune is generating a dispatcher behind the scenes, and instead some other mechanism is probably in effect.I feel that way for two reasons:Searching "gcc mtune cpu dispatcher" doesn't find anything relevant; andIf it was based on dispatcher, I think it could be smarter (even if by some option other than mtune) and test for cpuid to detect supported instructions at runtime, instead of relying on a named architecture which is provided at build time.So how does it work really? 解决方案 -mtune doesn't create a dispatcher, it doesn't need one: we are already telling the compiler what architecture we are targeting.From the GCC docs:This means that GCC won't use instructions available only on cpu-type 1 but it will generate code that run optimally on cpu-type.To understand this last statement is necessary to understand the difference between architecture and micro-architecture.The architecture implies an ISA (Instruction Set Architecture) and that's not influenced by the -mtune.The micro-architecture is how the architecture is implemented in hardware.For an equal instruction set (read: architecture), a code sequence may run optimally on a CPU (read micro-architecture) but not on another due to the internal details of the implementation.This can go as far as having a code sequence being optimal only on one micro-architecture.When generating the machine code often GCC has a degree of freedom in choosing how to order the instructions and what variant to use.It will use a heuristic to generate a sequence of instructions that run fast on the most common CPUs, sometime it will sacrifice a 100% optimal solution for CPU x if that will penalise CPUs y, z and w.When we use -mtune=x we are fine tuning the output of GCC for CPU x thereby producing a code that is 100% optimal (from the GCC perspective) on that CPU.As a concrete example consider how this code is compiled:float bar(float a[4], float b[4]){ for (int i = 0; i < 4; i++) { a[i] += b[i]; } float r=0; for (int i = 0; i < 4; i++) { r += a[i]; } return r;}The a[i] += b[i]; is vectorised (if the vectors don't overlap) differently when targeting a Skylake or a Core2:Skylake movups xmm0, XMMWORD PTR [rsi] movups xmm2, XMMWORD PTR [rdi] addps xmm0, xmm2 movups XMMWORD PTR [rdi], xmm0 movss xmm0, DWORD PTR [rdi]Core2 pxor xmm0, xmm0 pxor xmm1, xmm1 movlps xmm0, QWORD PTR [rdi] movlps xmm1, QWORD PTR [rsi] movhps xmm1, QWORD PTR [rsi+8] movhps xmm0, QWORD PTR [rdi+8] addps xmm0, xmm1 movlps QWORD PTR [rdi], xmm0 movhps QWORD PTR [rdi+8], xmm0 movss xmm0, DWORD PTR [rdi]The main difference is how an xmm register is loaded, on a Core2 it is loaded with two loads using movlps and movhps instead of using a single movups.The two loads approach is better on a Core2 micro-architecture, if you take a look at the Agner Fog's instructions tables you'll see that movups is decoded into 4 uops and has a latency of 2 cycles while each movXps is 1 uop and 1 cycle of latency.This is probably due to the fact that 128-bit accesses were split into two 64-bit accesses at the time.On Skylake the opposite is true: movups performs better than two movXps.So we have to pick up one.In general, GCC picks up the first variant because Core2 is an old micro-architecture, but we can override this with -mtune.1 Instruction set is selected with other switches. 这篇关于mtune如何工作？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！上岸，阿里云！