问题描述
我相信,在创建CPU时,如果选择了错误的分支,分支预测会大大降低。那么为什么CPU设计人员为什么选择一个分支而不是简单地执行两个分支,然后在确定确定选择了哪个分支之后就切断它?
I believe that when creating CPUs, branch prediction is a major slow down when the wrong branch is chosen. So why do CPU designers choose a branch instead of simply executing both branches, then cutting one off once you know for sure which one was chosen?
我意识到这只能在较短的指令内深入2或3个分支,否则并行阶段的数目会变得非常荒谬,因此在某些时候您仍然需要一些分支预测,因为您肯定会跨较大的分支运行,但不会几个阶段这样有意义吗?在我看来,这似乎可以大大加快处理速度,并值得增加一些复杂性。
I realize that this could only go 2 or 3 branches deep within a short number of instructions or the number of parallel stages would get ridiculously large, so at some point you would still need some branch prediction since you definitely will run across larger branches, but wouldn't a couple stages like this make sense? Seems to me like it would significantly speed things up and be worth a little added complexity.
即使只有一个分支深层,也几乎有一半的时间会被错误的分支吞噬,是吗?
Even just a single branch deep would almost half the time eaten up by wrong branches, right?
或者也许已经像这样完成了?分支通常只在您进入汇编程序时才在两个选择之间进行选择,对吗?
Or maybe it is already somewhat done like this? Branches usually only choose between two choices when you get down to assembly, correct?
推荐答案
您正确地担心会成指数增长填充机器,但是您低估了它的功能。
一个常见的经验法则表示,您可以期望在动态代码中平均拥有约20%的分支。这意味着每5条指令中就有一个分支。当今,大多数CPU都具有深度混乱的内核,可以预先获取并执行数百条指令-以Intel的Haswell为例,例如,它具有个条目ROB,这意味着您最多可以容纳4个分支级别(届时您将拥有16个前和31个块,每个分支包括一个分支分支) -假设每个块有5条指令,您几乎已经填满了ROB,并且超出了它的另一个级别)。到那时,您将只进展到约20条指令的有效深度,从而使任何指令级并行性都无济于事。
You're right in being afraid of exponentially filling the machine, but you underestimate the power of that.A common rule-of-thumb says you can expect to have ~20% branches on average in your dynamic code. This means one branch in every 5 instructions. Most CPUs today have a deep out-of-order core that fetches and executes hundreds of instructions ahead - take Intels' Haswell for e.g., it has a 192 entries ROB, meaning you can hold at most 4 levels of branches (at that point you'll have 16 "fronts" and 31 "blocks" including a single bifurcating branch each - assuming each block would have 5 instructions you've almost filled your ROB, and another level would exceed it). At that point you would have progressed only to an effective depth of ~20 instructions, rendering any instruction-level parallelism useless.
如果要在3个分支级别上分叉,这意味着您将拥有8个并行上下文,每个上下文只有24个可运行的条目。甚至仅当您忽略回滚工作的7/8的开销,需要复制所有节省状态的硬件(例如您拥有数十个寄存器)并且需要将其他资源分成8个部分时,用ROB做的。另外,这还不包括必须管理复杂的版本控制,转发,一致性等的内存管理。
If you want to diverge on 3 levels of branches, it means you're going ot have 8 parallel contexts, each would have only 24 entries available to run ahead. And even that's only when you ignore overheads for rolling back 7/8 of your work, the need to duplicate all state-saving HW (like registers, which you have dozens of), and the need to split other resources into 8 parts like you did with the ROB. Also, that's not counting memory management which would have to manage complicated versioning, forwarding, coherency, etc.
忘记功耗,即使您可以支持这种浪费的并行性,分散您的资源,使您在每条路径上继续执行多条指令之前,实际上会使您感到窒息。
Forget about power consumption, even if you could support that wasteful parallelism, spreading your resources that thin would literally choke you before you could advance more than a few instructions on each path.
现在,让我们研究一下拆分为单个的更合理的选择分支-开始看起来像超线程-您在2个上下文中拆分/共享了核心资源。可以肯定,此功能具有一些性能上的好处,但这仅是因为这两个上下文都是非推测性的。实际上,我认为,根据工作负载组合,一个接一个地运行两个上下文的普遍估计约为10-30%(摘自AnandTech的评论)-如果您确实确实打算一个接一个地运行这两个任务,那么很好,但是当您要放弃运行结果时就不行了其中之一。即使您忽略这里的模式切换开销,您也只会获得30%的损失而失去50%,这毫无意义。
Now, let's examine the more reasonable option of splitting over a single branch - this is beginning to look like Hyperthreading - you split/share your core resources over 2 contexts. This feature has some performance benefits, granted, but only because both context are non-speculative. As it is, I believe the common estimation is around 10-30% over running the 2 contexts one after the other, depending on the workload combination (numbers from a review by AnandTech here) - that's nice if you indeed intended to run both the tasks one after the other, but not when you're about to throw away the results of one of them. Even if you ignore the mode switch overhead here, you're gaining 30% only to lose 50% - no sense in that.
另一方面,您拥有预测分支的选项(当今的现代预测器平均可以达到95%以上的成功率),并支付错误预测带来的损失,这是乱序引擎已经部分掩盖的(某些早于分支的指令可能会在分支运行后执行)已清除,大多数OOO机器都支持)。这使任何深度混乱的引擎都可以自由地向前漫游,推测其最大潜在深度,并且在大多数情况下是正确的。此处使某些工作融合的几率确实在几何上降低了(第一个分支后为95%,第二个分支后为〜90%,依此类推),但冲洗罚分也降低了。它仍然远比全球效率1 / n(n级分叉)要好得多。
On the other hand, you have the option of predicting the branches (modern predictors today can reach over 95% success rate on average), and paying the penalty of misprediction, which is partially hidden already by the out-of-order engine (some instructions predating the branch may execute after it's cleared, most OOO machines support that). This leaves any deep out-of-order engine free to roam ahead, speculating up to its full potential depth, and being right most of the time. The odds of flusing some of the work here do decrease geometrically (95% after the first branch, ~90% after the second, etc..), but the flush penalty also decreases. It's still far better than a global efficiency of 1/n (for n levels of bifurcation).
这篇关于为什么要预测一个分支,而不是简单地并行执行两个分支?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!