问题描述
几年前,我正在学习x86汇编程序,CPU流水线,缓存未命中,分支预测以及所有这些爵士乐。
这是一个两半的故事。我了解了冗长的流水线在处理器中的所有奇妙优点,即指令重排序,高速缓存预加载,依赖项交织等。
缺点是标准的任何偏差成本很高。例如,每当您通过指针(!)调用函数时,IIRC在千兆赫兹时代的某个AMD处理器都会受到 40个周期 的惩罚。 。
这不是一个微不足道的不用担心数字!请记住,好的设计通常意味着因素。尽可能多地使用您的函数和在数据类型中编码语义,这通常意味着虚拟接口。
权衡是不能执行此类操作的代码每个周期操作可能获得两个以上的指令。这些是编写高性能C ++代码时要担心的数字,这些代码在对象设计上很繁琐,而数字运算却很繁琐。
我知道随着我们进入低功耗时代,CPU流水线趋势正在逆转。这是我的问题:
最新一代的x86兼容处理器是否仍因虚拟函数调用,错误的分支预测等而遭受重罚?
呵呵..太大了。
有一个间接分支预测如果前一段时间存在相同的间接跳转,则该方法有助于预测虚拟函数的跳转。初次和错误预测的侵权行为仍然会受到惩罚。
支持不同于简单的仅当先前的间接分支完全相同时才预测正确。到非常复杂的两级数十或数百个条目,并为单个间接jmp指令检测到2-3个目标地址的周期性交替。
这里有很多演变…… p>
和相同的pdf,第14页
Agner的手册较长
还有很多理论上的间接分支预测理论,它们描述了许多现代CPU中的分支预测器,以及每个制造商的cpus中的预测器的发展(x86 / x86_64)。方法(在Google学术搜索中查看);甚至Wiki也对它说了一些话 /
从agner的微观角度来看原子:
因此,对于低功耗而言,间接分支预测并不那么先进。 Via Nano也是如此:
我认为,较短的低功耗x86流水线具有较低的惩罚,为7-20滴答。
Years ago I was learning about x86 assembler, CPU pipelining, cache misses, branch prediction, and all that jazz.
It was a tale of two halves. I read about all the wonderful advantages of the lengthy pipelines in the processor viz instruction reordering, cache preloading, dependency interleaving, etc.
The downside was that any deviation for the norm was enormously costly. For example, IIRC a certain AMD processor in the early-gigahertz era had a 40 cycle penalty every time you called a function through a pointer (!) and this was apparently normal.
This is not a negligible "don't worry about it" number! Bear in mind that "good design" normally means "factor your functions as much as possible" and "encode semantics in the data types" which often implies virtual interfaces.
The trade-off is that code which doesn't perform such operations might get more than two instructions per cycle. These are numbers one wants to worry about when writing high-performance C++ code which is heavy on the object design and light on the number crunching.
I understand that the long-CPU-pipeline trend has been reversing as we enter the low-power era. Here's my question:
Does the latest generation of x86-compatible processors still suffer massive penalties for virtual function calls, bad branch predictions, etc?
Huh.. so large..
There is an "Indirect branch prediction" method, which helps to predict virtual function jump, IF there was the same indirect jump some time ago. There is still a penalty for first and mispredicted virt. function jump.
Support varies from simple "predicted right if and only if the previous indirect branch was exactly the same" to very complex two-level tens or hundreds entries with detecting of periodic alternation of 2-3 target address for single indirect jmp instruction.
There was a lot of evolution here...
http://arstechnica.com/hardware/news/2006/04/core.ars/7
http://www.realworldtech.com/page.cfm?ArticleID=rwt051607033728&p=3
http://www.realworldtech.com/page.cfm?ArticleID=RWT102808015436&p=5
http://www.agner.org/optimize/microarchitecture.pdf
and the same pdf, page 14
Agner's manual has a longer description of branch predictor in many modern CPUs and the evolution of predictor in cpus of every manufacturer (x86/x86_64).
Also a lot of theoretical "indirect branch prediction" methods (look in the Google scholar); even wiki said some words about it http://en.wikipedia.org/wiki/Branch_predictor#Prediction_of_indirect_jumps /
For Atoms from the agner's micro:
So, for low power, indirect branch prediction is not so advanced. So does Via Nano:
I think, that shorter pipeline of lowpower x86 has lower penalty, 7-20 ticks.
这篇关于CPU体系结构的演变如何影响虚拟函数调用性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!