问题描述
我在Patterson&中阅读了以下声明轩尼诗的计算机组织与设计教科书:
我可以理解为什么每个时钟周期发出多个指令可以使单个延迟槽不足,但是我不知道为什么较长的管道会导致它。
此外,我不明白为什么较长的管道会导致分支延迟变得更长。即使使用更长的流水线(完成一条指令的步骤),也无法保证周期会增加,为什么分支延迟会增加?
如果您在检测分支的阶段之前添加任何阶段(并评估条件分支的已采用/未采用),则1个延迟时隙将不再隐藏等待时间。在进入管道第一阶段的分支和已知分支之后正确的程序计数器地址之间。
第一阶段需要获取信息从管道的后面知道下一步要获取什么,因为它不会自己检测分支。例如,在具有分支预测的超标量CPU中,他们需要预测哪个指令块
1个延迟时隙仅在MIPS I中就足够了,因为分支条件在中的一个时钟周期,及时转发到IF的后半部分,直到那时才需要获取地址。 (原始MIPS是经典的5阶段RISC:IF ID EX MEM WB。)请参见以获取更多详细信息,尤其是。 / p>
这就是为什么MIPS仅限于简单条件,例如 beq
(查找与XOR的任何不匹配项),或 bltz
(符号位检查)。它无法执行任何需要加法器进行进位传播的操作(因此,两个寄存器之间的一般 blt
是)。
这是非常严格的:更长的前端可以从更大/更相关的L1指令缓存中吸收等待时间,这需要超过半个周期才能对命中做出响应。 (尽管MIPS I解码非常简单,但由于采用了有意设计的指令格式,因此机器码位可以直接连接为内部控制信号。因此,您可以使解码成为半周期阶段,获取时获得1个完整周期,但是即使在较高时钟速度下使用更短的周期时间,即使1个周期仍然很低。)
提高时钟速度可能需要增加另一个获取阶段。解码确实必须检测数据危害并设置旁路转发;原始的MIPS通过不检测负载使用危险来简化操作,相反,软件必须遵守负载延迟插槽,直到MIPS II。超标量CPU甚至具有1个周期的ALU延迟都具有更多可能的危害,因此,检测到什么要转发到哪些内容需要更复杂的逻辑以将旧指令中的目标寄存器与年轻指令中的源进行匹配。
超标量流水线甚至可能希望在指令提取中进行一些缓冲以避免气泡。多端口寄存器文件的读取可能会稍慢一些,可能需要额外的解码流水线阶段,尽管可能仍可以在1个周期内完成。
因此,还要使1由于超标量执行的本质,分支延迟时隙不足,如果额外的阶段介于获取和分支解析之间,则更长的流水线也会增加分支延迟。例如额外的获取阶段和2级流水线可能在分支之后有4条指令在运行而不是1。
与其引入更多的分支延迟 slots 来隐藏此分支延迟,实际的解决方案是分支 prediction 。 (但是,某些DSP或高性能微控制器确实有2个甚至3个分支延迟槽。)
分支延迟槽使异常处理变得复杂;您需要一个故障返回和一个该地址之后的地址,以防故障发生在分支转移的延迟槽中。
I read the following statement in Patterson & Hennessy's Computer Organization and Design textbook:
I can understand why "issuing multiple instructions per clock cycle" can make a single delay slot insufficient, but I don't know why "longer pipelines" cause it.
Also, I do not understand why longer pipelines cause the branch delay to become longer. Even with longer pipelines (step to finish one instruction), there's no guarantee that the cycle will increase, so why will the branch delay increase?
If you add any stages before the stage that detects branches (and evaluates taken/not-taken for conditional branches), 1 delay slot no longer hides the "latency" between the branch entering the first stage of the pipeline and the correct program-counter address after the branch being known.
The first fetch stage needs info from later in the pipeline to know what to fetch next, because it doesn't itself detect branches. For example, in superscalar CPUs with branch prediction, they need to predict which block of instructions to fetch next, separately and earlier from predicting which way a branch goes after it's already decoded.
1 delay slot is only sufficient in MIPS I because branch conditions are evaluated in the first half of a clock cycle in EX, in time to forward to the 2nd half of IF which doesn't need a fetch address until then. (Original MIPS is a classic 5-stage RISC: IF ID EX MEM WB.) See Wikipedia's article on the classic RISC pipeline for much more details, specifically the control hazards section.
That's why MIPS is limited to simple conditions like beq
(find any mismatches from an XOR), or bltz
(sign bit check). It cannot do anything that requires an adder for carry propagation (so a general blt
between two registers is only a pseudo-instruction).
This is very restrictive: a longer front-end can absorb the latency from a larger/more associative L1 instruction cache that takes more than half a cycle to respond on a hit. (MIPS I decode is very simple, though, with the instruction format intentionally designed so machine-code bits can be wired directly as internal control signals. So you can maybe make decode the "half cycle" stage, with fetch getting 1 full cycle, but even 1 cycle is still low with shorter cycle times at higher clock speeds.)
Raising the clock speed might require adding another fetch stage. Decode does have to detecting data hazards and set up bypass forwarding; original MIPS kept that simpler by not detecting load-use hazards, instead software had to respect a load-delay slot until MIPS II. A superscalar CPU has many more possible hazards, even with 1-cycle ALU latency, so detecting what has to forward to what requires more complex logic for matching destination registers in old instructions against sources in younger instructions.
A superscalar pipeline might even want some buffering in instruction fetch to avoid bubbles. A multi-ported register file might be slightly slower to read, maybe requiring an extra decode pipeline stage, although probably that can still be done in 1 cycle.
So, as well as making 1 branch delay slot insufficient by the very nature of superscalar execution, a longer pipeline also increases branch latency, if the extra stages are between fetch and branch resolution. e.g. an extra fetch stage and a 2-wide pipeline could have 4 instructions in flight after a branch instead of 1.
But instead of introducing more branch delay slots to hide this branch delay, the actual solution is branch prediction. (However some DSPs or high performance microcontrollers do have 2 or even 3 branch delay slots.)
Branch-delay slots complicate exception handling; you need a fault-return and a next-after-that address, in case the fault was in a delay slot of a taken branch.
这篇关于为什么更长的管线会使单个延迟时隙不足?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!