问题描述
我刚刚浏览了彼得·科德斯(Peter Cordes)的答案,他说,
我感觉还不明白什么是部分国旗摊位".我怎么知道一个人发生了?读取标志时,除了有时之外,还会触发什么事件?合并标志是什么意思?在什么条件下写了一些标志",但不发生部分标志合并?我需要了解哪些有关旗位的知识才能理解它们?
I don't feel like I understand yet what a "partial flag stall" is. How do I know one has occurred? What triggers the event other than sometimes when flags are read? What does it mean to merge flags? In what condition are "some of the flags written" but a partial-flag merge doesn't happen? What do I need to know about flag stalls to understand them?
推荐答案
通常来说,当使用标志的指令读取一个或多个不是由最新标志设置指令写入的标志时,就会发生部分标志停顿. .
Generally speaking a partial flag stall occurs when a flag-consuming instruction reads one or more flags that were not written by the most recent flag-setting instruction.
因此,像inc
这样的仅设置一些标志(未设置CF
)的指令不会固有地 引起部分停顿,但是会导致停顿 后续指令读取未由inc
设置的标志(CF
)(没有任何设置CF
标志的中间指令).这也意味着写所有有趣标志的指令永远不会涉及部分停顿,因为当它们是执行标志读取指令时的最新标志设置指令时,它们必须已写入消耗的标志
So an instruction like inc
that sets only some flags (it doesn't set CF
) doesn't inherently cause a partial stall, but will cause a stall if a subsequent instruction reads the flag (CF
) that was not set by inc
(without any intervening instruction that sets the CF
flag). This also implies that instructions that write all interesting flags are never involved in partial stalls since when they are the most recent flag setting instruction at the point a flag reading instruction is executed, they must have written the consumed flag.
因此,通常,用于静态确定是否会发生部分标志停顿的算法是查看使用这些标志的每条指令(通常是jcc
系列和cmovcc
,以及一些专门的指令,例如adc
),然后向后走以找到设置 any 标志的第一条指令,并检查它是否设置了使用指令读取的所有标志.否则,将发生部分标志停顿.
So, in general, an algorithm for statically determining whether a partial flags stall will occur is to look at each instruction that uses the flags (generally the jcc
family and cmovcc
and a few specialized instructions like adc
) and then walk backwards to find the first instruction that sets any flag and check if it sets all of the flags read by the consuming instruction. If not, a partial flags stall will occur.
从Sandy Bridge开始的较新的体系结构本身并不会遭受部分 stall 的标记,但仍会受到指令添加到前端的附加uop形式的损失.在某些情况下.与以上讨论的摊位相比,这些规则略有不同,并且适用于一组较窄的案件.特别是,仅当从多个标志读取标志使用指令并且这些标志最后由不同指令设置时,才添加所谓的标志合并uop .例如,这意味着检查单个标志的指令决不会导致发出合并的uop.
Later architectures, starting with Sandy Bridge, don't suffer a partial flags stall per se, but still suffer a penalty in the form of an additional uop added to the front-end by the instruction in some cases. The rules are slightly different and apply to a narrower set of cases compared to the stall discussed above. In particular, the so-calling flag merging uop is added only when a flag consuming instruction reads from multiple flags and those flags were last set by different instructions. This means, for example, that instructions that examine a single flag never cause a merging uop to be emitted.
从Skylake(可能还有Broadwell)开始,我没有发现任何合并uops的证据.取而代之的是,uop格式已扩展为最多可容纳3个输入,这意味着分别重命名的进位标志和重命名的SPAZO组标志都可以用作大多数指令的输入.例外情况包括诸如cmovbe
的指令,该指令具有两个寄存器输入,其条件be
要求同时使用C标志和一个或多个SPAZO标志.但是,大多数条件移动仅使用C和SPAZO标志中的一个或另一个,并采用一个uop.
Starting from Skylake (and probably starting from Broadwell), I find no evidence of any merging uops. Instead, the uop format has been extended to take up to 3 inputs, meaning that the separately renamed carry flag and the renamed-together SPAZO group flags can both be used as inputs to most instructions. Exceptions include instructions like cmovbe
which has two register inputs, and whose condition be
requires the use of both the C flag and one or more of the SPAZO flags. Most conditional moves use only one or the other of C and SPAZO flags, however, and take one uop.
这里有一些例子.我们同时讨论了"[partial flag]停顿"和"merge uops",但如上所述,最多只有两者之一适用于任何给定的体系结构,因此应该使用以下内容导致停顿和合并uop发出"之类的东西.可以理解为以下内容导致[在具有部分标志停顿的较旧体系结构上出现停顿]或[在使用合并uops替代的较新体系结构上]出现合并uop".
Here are some examples. We discuss both "[partial flag] stalls" and "merge uops", but as above only at most one of the two applies to any given architecture, so something like "The following causes a stall and a merge uop to be emitted" should be read as "The following causes a stall [on those older architectures which have partial flag stalls] or a merge uop [on those newer architectures which use merge uops instead]".
以下示例将导致失速和合并的uop在Sandy Bridge和Ivy Bridge上发出,但在Skylake上不会发出:
The following example will cause a stall and merging uop to be emitted on Sandy Bridge and Ivy Bridge, but not on Skylake:
add rbx, 5 ; sets CF, ZF, others
inc rax ; sets ZF, but not CF
ja label ; reads CF and ZF
ja
指令读取分别由add
和inc
指令最后设置的CF
和ZF
,因此插入合并uop以统一由.在停顿的体系结构上,发生停顿的原因是ja
从CF
读取,而最新的标志设置指令未设置该值.
The ja
instruction reads CF
and ZF
which were last set by the add
and inc
instructions, respectively, so a merge uop is inserted to unify the separately set flags for consumption by ja
. On architectures that stall, a stall occurs because ja
reads from CF
which was not set by the most recent flag setting instruction.
add rbx, 5 ; sets CF, ZF, others
inc rax ; sets ZF, but not CF
jc label ; reads CF
这会导致停顿,因为如在先前示例中一样,读取的是CF
,它不是由最后一个标志设置指令(此处为inc
)设置的.在这种情况下,可以通过简单地交换inc
和add
的顺序来避免停顿,因为它们是独立的,然后jc
将仅从最近的标志设置操作中读取.不需要合并uop,因为读取的标志(仅CF
)全部来自同一add
指令.
This causes a stall because as in the prior example CF
is read which is not set by the last flag setting instruction (here inc
). In this case, the stall could be avoided by simply swapping the order of the inc
and add
since they are independent and then the jc
would read only from the most recent flag setting operation. There is no merge uop needed because the flags read (only CF
) all come from the same add
instruction.
注意:此案正在辩论中(请参见)-但我无法对其进行测试,因为我在Skylake上根本找不到任何合并操作的证据.
Note: This case is under debate (see the comments) - but I cannot test it because I don't find evidence of any merging ops at all on my Skylake.
add rbx, 5 ; sets CF, ZF, others
inc rax ; sets ZF, but not CF
jnz label ; reads ZF
这里,即使最后一条指令(inc
)仅设置了一些标志,也不需要停顿或合并uop,因为使用中的jnz
仅读取由inc
设置的标志(的子集),而没有其他.因此,这种常见的循环习惯用法(通常使用dec
而不是inc
)本质上不会引起问题.
Here there is no stall or merging uop needed, even though the last instruction (inc
) only sets some flags, because the consuming jnz
only reads (a subset of) flags set by the inc
and no others. So this common looping idiom (usually with dec
instead of inc
) doesn't inherently cause a problem.
这是另一个不会导致停顿或合并uop的示例:
Here's another example that doesn't cause any stall or merge uop:
inc rax ; sets ZF, but not CF
add rbx, 5 ; sets CF, ZF, others
ja label ; reads CF and ZF
在这里ja
确实读取了CF
和ZF
,并且存在一个未设置ZF
的inc
(即部分标志写入指令),但是没有问题,因为add
在inc
之后,并写入所有相关标志.
Here the ja
does read both CF
and ZF
and an inc
is present which doesn't set ZF
(i.e., a partial flag writing instruction), but there is no problem because the add
comes after the inc
and writes all the relevant flags.
以可变和固定计数形式出现的移位指令sar
,shr
和shl
的行为与上述行为不同(通常更差),并且在整个体系结构中变化很多.这可能是由于它们奇怪且不一致的标志处理.例如,在许多体系结构上,在移位计数为1以外的移位指令后读取 any 标志时,会出现部分标志停顿的情况.即使在最新的体系结构上,变量移位的成本也高达3由于进行了标志处理(因此不再有停顿").
The shift instructions sar
,shr
and shl
in both their variable and fixed count forms behave differently (generally worse) than described above and this varies a fair amount across architectures. This is probably due to their weird and inconsistent flag handling. For example, on many architectures there is something like a partial flags stall when reading any flag after a shift instruction with a count other than 1. Even on the most recent architectures variable shifts have a significant cost of 3 uops due to flag handling (but there is no more "stall").
我不会在此处包括所有细节,但是我建议在Agner的 microarch文档.
I'm not going to include all the gory details here, but I'd recommend looking for the word shift in Agner's microarch doc if you want all the details.
在某些情况下,某些轮换指令也具有与标志相关的有趣行为,类似于移位.
Some rotate instructions also have interesting flag related behavior in some cases similar to shifts.
例如,根据移位计数是0、1还是其他某个值来设置不同的标志子集.
For example, setting different subsets of flags depending on whether the shift count is 0, 1 or some other value.
这篇关于什么是局部标志失速?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!