DSB(Uop缓存) uop缓存(英特尔喜欢将其称为DSB)能够缓存大多数中等数量指令的循环.在一个典型的程序中,您希望大多数指令都从此缓存中提供.我们可以重复上面的测试,但是现在从uop缓存中提供uops.这很简单,就是将nops的大小增加到2个字节,因此我们不再达到18条指令的限制.我们在循环中使用2字节的nop xchg ax, ax:long_nop_test: mov rax, itersALIGN 32.top: dec eax xchg ax, ax ; this is a 2-byte nop ... xchg ax, ax jnz .top ret在这里,结果非常简单.对于所有通过DSB传递的经过测试的循环大小,所需的循环数为N/4-即,即使循环的倍数不超过4微秒,循环也以最大理论吞吐量执行.因此,一般而言,在Skylake上,DSB之外的中等大小的循环无需担心确保uop计数满足某些特定的倍数.这里有一个图表,显示了1,000个uop循环.如果斜视,则可以看到在64微秒之前(当LSD中存在循环时)次佳的行为.在那之后,这是一个直截了当的过程,一路达到4 ipc达到1,000 uops(可能是由于我的盒子上的负载造成的大约900 bbp的斑点):接下来,我们看一下循环的性能,这些循环足够小以适合uop缓存. LSD(循环蒸汽检测器) 重要说明:英特尔显然已通过微代码更新以及在Skylake上禁用了 Skylake(SKL150 erratum)和Kaby Lake(KBL095,KBW095 erratum)芯片上的LSD-由于一个与超线程之间的交互有关的错误和LSD.对于这些芯片,下图可能不会有有趣的区域,最高可达64 oups.相反,它将看起来与64 oups之后的区域相同.循环流检测器可以缓存高达64 uops的小循环(在Skylake上).在Intel最近的文档中,它定位为更节能的机制,而不是性能功能-尽管使用LSD肯定没有提及性能下降.针对适合LSD的循环大小运行它,我们得到以下循环/迭代行为:此处的红线是从LSD发送的uops的百分比.从5到56 oups的所有循环大小,其平坦度均为100%.对于3个和4个uop循环,我们具有不同寻常的行为,分别从传统解码器传递了16%和25%的uops. ??幸运的是,尽管这两种情况都达到了1个循环/周期的最大吞吐量,但似乎并没有影响循环的吞吐量,尽管事实可能是有人期望获得某些MITE-> LSD过渡惩罚.在环路尺寸为57到62 oups之间,从LSD传递的uops数量表现出一些奇怪的行为-大约70%的uops是从LSD传递的,其余的是从DSB传递的. Skylake名义上具有64 uop LSD,因此这是在超出LSD大小之前的一种过渡-也许IDQ(实施LSD的内部)内部存在某种内部对齐方式,只会导致部分命中在这个阶段的LSD.这个阶段很短,从性能角度来看,似乎主要是它之前的LDS完全性能和其后的DSB完全性能的线性组合.让我们看一下结果的主体,范围是5到56 oups.我们看到三个不同的区域: 从3到10 oops循环:这里的行为很复杂.这是我们看到的唯一无法通过单循环迭代的静态行为来解释的周期计数的区域.范围足够短,很难说是否有模式.分别在N/4个周期(与下一个区域的模式相同)中,最佳地执行4、6和8的循环.另一方面,一个10 oups的循环在每次迭代中以2.66个周期执行,这使其成为唯一的均匀大小的循环,直到您达到34 uop或更大的循环大小(异常值除外)后,它才能最佳地执行在26).这相当于重复的uop/cycle执行率4, 4, 4, 3.对于5微秒的循环,您每次迭代可获得1.33个周期,非常接近,但与理想值1.25不同.这对应于4, 4, 4, 4, 3的执行率.这些结果很难解释.结果在每次运行之间都是可重复的,并且对于更改(例如,将nop换成实际上执行mov ecx, 123之类的指令)具有鲁棒性.这可能与每2个周期限制1个分支有关,这适用于除非常小"的所有循环之外的所有循环.可能是因为微调偶尔会排成一排,从而使此限制生效,从而导致一个额外的循环.一旦达到12或更高,就永远不会发生,因为您每次迭代总是要花费至少三个周期. 从11循环到32循环:我们看到一个阶梯模式,但周期为2.基本上所有 even 个微指令的循环都表现最佳-即,精确地执行N/4个循环.具有奇数个微指令的循环会浪费一个问题时隙",并且占用的循环数与具有一个多微指令的循环相同的循环数(即17 uop循环与18 uop循环具有相同的4.5循环).因此,在许多uop计数上,我们的行为都优于ceiling(N/4),并且我们有第一个证据表明Skylake至少可以在非整数周期内执行循环.唯一的异常值是N = 25和N = 26,它们都比预期多花费约1.5%.它虽然小巧但可重现,并且在文件中四处移动功能很强大.除非它有一个巨大的时期,否则这个数目太小了,无法用每次迭代的效果来解释,所以可能还有其他原因.这里的总体行为与硬件展开循环完全相同(在25/26异常之外), 从33循环变为〜64 oops :我们再次看到了阶梯模式,但周期为4,并且平均性能比高达32 uop的情况差.行为完全是ceiling(N/4),即与传统解码器的情况相同.因此,对于32到64 oups的循环,就特定限制而言,在前端吞吐量方面,LSD不能提供比传统解码器明显的优势.当然,LSD还有很多其他更好的方法-它避免了在更复杂或更长时间的指令中出现的许多潜在的解码瓶颈,并且节省了功耗等.所有这些都非常令人惊讶,因为这意味着从uop缓存传递的循环通常比从LSD传递的循环在前端执行更好,尽管LSD通常被严格地定位为比DSB更好的uops来源(例如,作为建议的一部分,尝试使循环保持足够小以适合LSD).这是查看相同数据的另一种方法-就给定uop计数的效率损失而言,与理论上每个周期4 oups的最大吞吐量相比.效率降低10%意味着您仅拥有通过简单的N/4公式计算出的吞吐量的90%.此处的总体行为与硬件不进行任何展开是一致的,这是有道理的,因为超过32 uops的循环根本无法在64 uops的缓冲区中展开.上面讨论的三个区域的颜色不同,至少可以看到竞争效果: 在其他所有条件相同的情况下,所涉及的微指令数量越多,效率命中率越低.命中是每次迭代仅一次的固定成本,因此较大的循环只需较小的相对成本. 当您进入超过33个uop区域时,效率低下的幅度很大:吞吐量损失的大小增加了,受影响的uop数量增加了一倍. 第一个区域有些混乱,7 uop是最糟糕的总体uop计数.对齐上面的DSB和LSD分析适用于对齐到32字节边界的循环条目,但是未对齐的情况在两种情况下似乎都不会受到影响:与对齐的情况没有实质性区别(也许不是不到10微秒的一些小变化,我没有进一步研究.)这是32N-2和32N+2的未对齐结果(即循环在32B边界之前和之后的前2个字节):还显示了理想的N/4行以供参考. Haswell接下来,接下来看一下以前的微体系结构:Haswell.此处的数字由用户 Iwillnotexist Idonotexist 慷慨地提供. LSD +旧版解码管道首先,密集代码"测试的结果将测试LSD(用于较小的uop计数)和传统管道(用于较大的uop计数),因为循环由于指令密度而使DSB消失"./p>我们立即看到,在何时每种体系结构从LSD传递微指令以实现密集循环方面已经有所不同.下面我们比较Skylake和Haswell的 dense 代码的短循环(每条指令1个字节).如上所述,正如每32字节代码限制区域中的18 uop所预期的那样,Skylake循环恰好从19 ups停止从LSD传递.另一方面,Haswell似乎也停止从LSD可靠地传递16位和17位循环.我对此没有任何解释. 3-uop情况也存在差异:奇怪的是,在3和4 uop情况下,两个处理器仅将它们的oups中的 some 传送出LSD,但是确切的数量对于4 uops,与3不同.我们主要关心实际性能,对吧?因此,让我们看一下32字节对齐的 dense 代码案例的周期/迭代次数:与上面显示的关于Skylake的数据相同(未对齐的序列已删除),Haswell绘制在旁边.立即您会发现,对于Haswell,该模式相似,但并不相同.如上所述,这里有两个区域:旧版解码大于16〜18 oups(不确定性已在上面描述)的循环是从传统解码器传递的. Haswell的模式与Skylake有所不同.对于19-30 oups的范围,它们是相同的,但是此后Haswell打破了规律. Skylake花费了ceil(N/4)个周期来处理从传统解码器传递来的循环.另一方面,Haswell似乎采取了类似ceil((N+1)/4) + ceil((N+2)/12) - ceil((N+1)/12)的方法.好的,那是一团糟(简短的形式,有人吗?)-但是基本上,这意味着,尽管Skylake最佳地以4 * N个周期执行循环(即,以4-uops/周期执行),但这样的循环(局部)通常是最佳计数(至少在本地)-与Skylake相比,执行此类循环要多花一个周期.因此,实际上,在Haswell上最好使用4N-1微循环, 除了这样的循环的25%也是16-1N形式的(em,(31, 47、63等)需要多一个周期.开始听起来像是a年的计算-但上面的模式可能最好从视觉上理解.对于在Haswell上的uop调度,我认为这种模式不是内在,所以我们不应该过多地了解它.似乎由解释0000000000455a80 <short_nop_aligned35.top>:16B cycle 1 1 455a80: ff c8 dec eax 1 1 455a82: 90 nop 1 1 455a83: 90 nop 1 1 455a84: 90 nop 1 2 455a85: 90 nop 1 2 455a86: 90 nop 1 2 455a87: 90 nop 1 2 455a88: 90 nop 1 3 455a89: 90 nop 1 3 455a8a: 90 nop 1 3 455a8b: 90 nop 1 3 455a8c: 90 nop 1 4 455a8d: 90 nop 1 4 455a8e: 90 nop 1 4 455a8f: 90 nop 2 5 455a90: 90 nop 2 5 455a91: 90 nop 2 5 455a92: 90 nop 2 5 455a93: 90 nop 2 6 455a94: 90 nop 2 6 455a95: 90 nop 2 6 455a96: 90 nop 2 6 455a97: 90 nop 2 7 455a98: 90 nop 2 7 455a99: 90 nop 2 7 455a9a: 90 nop 2 7 455a9b: 90 nop 2 8 455a9c: 90 nop 2 8 455a9d: 90 nop 2 8 455a9e: 90 nop 2 8 455a9f: 90 nop 3 9 455aa0: 90 nop 3 9 455aa1: 90 nop 3 9 455aa2: 90 nop 3 9 455aa3: 75 db jne 455a80 <short_nop_aligned35.top>在这里,我注意到了每条指令出现的16B解码块(1-3),以及对其进行解码的周期.规则基本上是,只要接下来的4条指令落在当前的16B块中,就可以对其进行解码.否则,他们必须等到下一个周期.对于N = 35,我们看到在周期4中丢失了1个解码时隙(16B块中只剩下3条指令),但否则循环与16B边界甚至最后一个周期都很好地对齐( 9)可以解码4条指令.这里是截断的N = 36的外观,除了循环的末尾,它是相同的:0000000000455b20 <short_nop_aligned36.top>:16B cycle 1 1 455a80: ff c8 dec eax 1 1 455b20: ff c8 dec eax 1 1 455b22: 90 nop ... [29 lines omitted] ... 2 8 455b3f: 90 nop 3 9 455b40: 90 nop 3 9 455b41: 90 nop 3 9 455b42: 90 nop 3 9 455b43: 90 nop 3 10 455b44: 75 da jne 455b20 <short_nop_aligned36.top>现在在第3个和最后一个16B块中有5条指令需要解码,因此需要一个额外的周期.基本上有35条指令,针对这种特定的指令模式恰好与16B位边界对齐,并在解码时节省了一个周期.这并不意味着N = 35通常比N = 36好!不同的指令将具有不同数量的字节,并且将以不同的方式排列.一个类似的对齐问题也说明了每16个字节需要的额外周期:16B cycle... 2 7 45581b: 90 nop 2 8 45581c: 90 nop 2 8 45581d: 90 nop 2 8 45581e: 90 nop 3 8 45581f: 75 df jne 455800 <short_nop_aligned31.top>此处最后的jne已滑入下一个16B块(如果指令跨越16B边界,则实际上在后一个块中),从而导致额外的周期损失.这仅每16个字节发生一次. Haswell遗留解码器的结果可以用遗留解码器完美解释,例如,在Agner Fog的微体系结构文档.实际上,如果您假设Skylake每个周期可以解码5条指令(最多传送5微指令),似乎也可以解释Skylake的结果.假设可以的话,Skylake在此代码上的渐近旧式解码吞吐量 仍然为4微秒,因为16个nop的块解码5-5-5-1,而4-4-4-4在Haswell上,因此您只能从边缘获得好处:例如,在上述N = 36的情况下,Skylake可以解码所有剩余的5条指令,而Haswell可以解码4-1条指令,从而节省了周期.结果是,似乎可以以一种非常直接的方式来理解传统的解码器行为,并且主要的优化建议是继续对代码进行处理,以使其智能"地落入16B对齐的块中(也许像装箱一样是NP硬的?). DSB(和LSD再次出现)接下来,让我们看一下在LSD或DSB之外提供代码的情况-通过使用"long nop"测试来避免打破每32B块限制18-uop的情况,因此将其保留在DSB中.哈斯韦尔vs Skylake:请注意LSD行为-在此,Haswell会在恰好57 oups时停止在LSD中停止服务,这与已发布的57 uos LSD的大小完全一致.就像我们在Skylake上看到的那样,没有奇怪的过渡期". Haswell在3和4 uops上也有怪异的行为,其中分别只有〜0%和〜40%的oups来自LSD.在性能方面,Haswell通常与Skylake保持一致,但有一些偏差,例如大约65、77和97 oups,在此向上舍入到下一个周期,而Skylake始终能够维持4 uops/周期,甚至当这导致循环数不是整数时.在25和26 oups时与预期的轻微偏差已经消失.也许Skylake的6-uop交付速率有助于避免Haswell的4-uop交付速率遭受的uop-cache对齐问题.其他架构以下其他架构的结果由用户Andreas Abel友好地提供,但是由于我们在这里字符数有限,我们将不得不使用另一个答案进行进一步的分析.需要帮助尽管社区已经提供了许多平台的结果,但是我仍然对比Nehalem早,比Coffee Lake(特别是Cannon Lake,这是新的君主)更新的芯片上的结果感兴趣.生成这些结果的代码是公开的.此外,.ods中的上方的结果可用.格式也可以在GitHub中进行. 特别是,旧解码器的最大吞吐量显然在Skylake中从4增至5 oups,而uop缓存的最大吞吐量也从4增至6.这两个都可能影响结果在这里描述. 英特尔实际上喜欢将旧式解码器称为MITE(微指令翻译引擎),可能是因为使用 legacy实际标记架构的任何部分是一种假冒行为内涵. 从技术上讲,还有另一个甚至更慢的uops源-MS(微码排序引擎),用于执行大于4 uops的任何指令,但由于在这里我们忽略了它我们的循环都不包含微码指令. 之所以有效,是因为任何对齐的32字节块都可以在其uop缓存插槽中最多使用3种方式,并且每个插槽最多可容纳6微码.因此,如果您在一个32B块中使用了3 * 6 = 18个upu,则根本无法将代码存储在uop缓存中.在实践中很少会遇到这种情况,因为代码需要非常密集(每条指令少于2个字节)才能触发. nop指令解码为一个uop,但是在执行之前不会被消除(即,它们不使用执行端口)-但仍然占用了空间.前端,因此计入我们感兴趣的各种限制. LSD是循环流检测器,它直接在IDQ中缓存多达64个(Skylake)微码的小循环.在较早的体系结构中,它可以容纳28 uops(两个逻辑核心都处于活动状态)或56 uops(一个逻辑核心处于活动状态). 我们无法轻松地在这种模式下适应2 uop循环,因为这将意味着零个nop指令,这意味着dec和jnz指令将进行宏熔断,并相应地更改了uop计数.只需相信我的话,即所有4个或更少uops的循环最多以1个周期/迭代执行一次. 有趣的是,我只是针对短版本的Firefox运行了perf stat,在其中打开了一个选项卡,然后单击了一些Stack Overflow问题.对于交付的指令,我从DSB获得46%,从传统解码器获得50%,从LSD获得4%.这表明,至少对于像浏览器这样的大型分支代码,DSB仍无法捕获大部分代码(幸运的是,传统解码器还不错). 这样,我的意思是,所有其他周期计数都可以通过以uops表示有效"积分循环成本来解释(这可能比实际大小高出uops)并除以4.对于这些非常短的循环,这是行不通的-将任何整数除以4,就无法达到每次迭代1.333个周期.换​​句话说,在所有其他区域中,成本的形式为N/4对于某个整数N. 实际上,我们知道Skylake 每个周期可以从传统解码器中发送5 uop,但是我们不知道这5 uop是否可以来自5种不同的指令,或仅4个或更少.也就是说,我们希望Skylake可以以2-1-1-1模式解码,但是我不确定它是否可以以1-1-1-1-1模式解码.以上结果提供了一些证据,表明它确实可以解码1-1-1-1-1.I'm wondering how loops of various sizes perform on recent x86 processors, as a function of number of uops.Here's a quote from Peter Cordes who raised the issue of non-multiple-of-4 counts in another question:The issue is about whether loops need to be a multiple of N uops to execute at maximum uop throughput, where N is the width of the processor. (i.e., 4 for recent Intel processors). There are a lot of complicating factors when talking about "width" and count uops, but I mostly want to ignore those. In particular, assume no micro or macro-fusion.Peter gives the following example of a loop with 7 uops in its body:More generally, the claim is that each iteration of a loop with x uops in its body will take at least ceil(x / 4) iterations, rather than simply x / 4.Is this true for some or all recent x86-compatible processors? 解决方案 I did some investigation with Linux perf to help answer this on my Skylake i7-6700HQ box, and Haswell results have been kindly provided by another user. The analysis below applies to Skylake, but it is followed by a comparison versus Haswell.Other architectures may vary, and to help sort it all out I welcome additional results. The source is available).This question mostly deals with the front end, since on recent architectures it is the front end which imposes the hard limit of four fused-domain uops per cycle.Summary of Rules for Loop PerformanceFirst, I'll summarize the results in terms of a few "performance rules" to keep in mind when dealing with small loops. There are plenty of other performance rules as well - these are complementary to them (i.e., you probably don't break another rule to just to satisfy these ones). These rules apply most directly to Haswell and later architectures - see the other answer for an overview of the differences on earlier architectures.First, count the number of macro-fused uops in your loop. You can use Agner's instruction tables to look this up directly for every instruction, except that an ALU uop and immediately follow branch will usually fuse together into a single uop. Then based on this count:If the count is a multiple of 4, you're good: these loops execute optimally.If the count is even and less than 32, you're good, except if it's 10 in which case you should unroll to another even number if you can.For odd numbers you should try to unroll to an even number less than 32 or a multiple of 4, if you can.For loops larger than 32 uops but less than 64, you might want to unroll if it isn't already a multiple of 4: with more than 64 uops you'll get efficient performance at any value on Sklyake and almost all values on Haswell (with a few deviations, possibly alignment related). The inefficiencies for these loops are still relatively small: the values to avoid most are 4N + 1 counts, followed by 4N + 2 counts.Summary of FindingsFor code served out of the uop cache, there are no apparent multiple-of-4 effects. Loops of any number of uops can be executed at a throughput of 4 fused-domain uops per cycle.For code processed by the legacy decoders, the opposite is true: loop execution time is limited to integral number of cycles, and hence loops that are not a multiple of 4 uops cannot achieve 4 uops/cycle, as they waste some issue/execution slots.For code issued from the loop stream detector (LSD), the situation is a mix of the two and is explained in more detail below. In general, loops less than 32 uops and with an even number of uops execute optimally, while odd-sized loops do not, and larger loops require a multiple-of-4 uop count to execute optimally.What Intel SaysIntel actually has a note on this in their optimization manual, details in the other answer.DetailsAs anyone well-versed recent x86-64 architectures knows, at any point the fetch and decode portion of the front end may be working in one several different modes, depending on the code size and other factors. As it turns out, these different modes all have different behaviors with respect to loop sizing. I'll cover them separately follow.Legacy DecoderThe legacy decoder is the full machine-code-to-uops decoder that is used when the code doesn't fit in the uop caching mechanisms (LSD or DSB). The primary reason this would occur is if the code working set is larger than the uop cache (approximately ~1500 uops in the ideal case, less in practice). For this test though, we'll take advantage of the fact that the legacy decoder will also be used if an aligned 32-byte chunk contains more than 18 instructions.To test the legacy decoder behavior, we use a loop that looks like this:short_nop: mov rax, 100_000_000ALIGN 32.top: dec rax nop ... jnz .top retBasically, a trivial loop that counts down until rax is zero. All instructions are a single uop and the number of nop instructions is varied (at the location shown as ...) to test different sizes of loops (so a 4-uop loop will have 2 nops, plus the two loop control instructions). There is no macro-fusion as we always separate the dec and jnz with at least one nop, and also no micro-fusion. Finally, there is no memory access at (outside of the implied icache access).Note that this loop is very dense - about 1 byte per instruction (since the nop instructions are 1 byte each) - so we'll trigger the > 18 instructions in a 32B chunk condition as soon as hit 19 instructions in the loop. Based on examining the perf performance counters lsd.uops and idq.mite_uops that's exactly what we see: essentially 100% of the instructions come out of the LSD up until and including the 18 uop loop, but at 19 uops and up, 100% come from the legacy decoder.In any case, here are the cycles/iteration for all loop sizes from 3 to 99 uops:The blue points are the loops that fit in the LSD, and show somewhat complex behavior. We'll look at these later.The red points (starting at 19 uops/iteration), are handled by the legacy decoder, and show a very predictable pattern:All loops with N uops take exactly ceiling(N/4) iterationsSo, for the legacy decoder at least, Peter's observation holds exactly on Skylake: loops with a multiple of 4 uops may execute at an IPC of 4, but any other number of uops will waste 1, 2 or 3 execution slots (for loops with 4N+3, 4N+2, 4N+1 instructions, respectively).It is not clear to me why this happens. Although it may seem obvious if you consider that decoding happens in contiguous 16B chunks, and so at a decoding rate of 4 uops/cycle loops not a multiple of 4 would always have some trailing (wasted) slots in the cycle the jnz instruction is encountered. However, the actual fetch & decode unit is composed of predecode and decode phases, with a queue in-between. The predecode phase actually has a throughput of 6 instructions, but only decodes to the end of the 16-byte boundary on each cycle. This seems to imply that the bubble that occurs at the end of the loop could be absorbed by the predecoder -> decode queue since the predecoder has an average throughput higher than 4.So I can't fully explain this based on my understanding of how the predecoder works. It may be that there is some additional limitation in decoding or pre-decoding that prevents non-integral cycle counts. For example, perhaps the legacy decoders cannot decode instructions on both sides of a jump even if the instructions after the jump are available in the predecoded queue. Perhaps it is related to the need to handle macro-fusion.The test above shows the behavior where the top of the loop is aligned on a 32-byte boundary. Below is the same graph, but with an added series that shows the effect when the top of loop is moved 2 bytes up (i.e, now misaligned at a 32N + 30 boundary):Most loop sizes now suffer a 1 or 2 cycle penalty. The 1 penalty case makes sense when you consider decoding 16B boundaries and 4-instructions per cycle decoding, and the 2 cycle penalty cases occurs for loops where for some reason the DSB is used for 1 instruction in the loop (probably the dec instruction which appears in its own 32-byte chunk), and some DSB<->MITE switching penalties are incurred.In some cases, the misalignment doesn't hurt when it ends up better aligning the end of the loop. I tested the misalignment and it persists in the same way up to 200 uop loops. If you take the description of the predecoders at face value, it would seem that, as above, they should be able to hide a fetch bubble for misalignment, but it doesn't happen (perhaps the queue is not big enough).DSB (Uop Cache)The uop cache (Intel likes to call it the DSB) is able to cache most loops of moderate amount of instructions. In a typical program, you'd hope that most of your instructions are served out of this cache.We can repeat the test above, but now serving uops out of the uop cache. This is a simple matter of increasing the size of our nops to 2 bytes, so we no longer hit the 18-instruction limit. We use the 2-byte nop xchg ax, ax in our loop:long_nop_test: mov rax, itersALIGN 32.top: dec eax xchg ax, ax ; this is a 2-byte nop ... xchg ax, ax jnz .top retHere, there results are very straightforward. For all tested loop sizes delivered out of the DSB, the number of cycles required was N/4 - i.e., the loops executed at the maximum theoretical throughput, even if they didn't have a multiple of 4 uops. So in general, on Skylake, moderately sized loops served out of the DSB shouldn't need to worry about ensuring the uop count meets some particular multiple.Here's a graph out to 1,000 uop loops. If you squint, you can see the sub-optimal behavior before 64-uops (when the loop is in the LSD). After that, it's a straight shot, 4 IPC the whole way to 1,000 uops (with a blip around 900 that was probably due to load on my box):Next we look at performance for loops that are small enough to fit in the uop cache.LSD (Loop steam detector)Important note: Intel has apparently disabled the LSD on Skylake (SKL150 erratum) and Kaby Lake (KBL095, KBW095 erratum) chips via a microcode update and on Skylake-X out of the box, due to a bug related to the interaction between hyperthreading and the LSD. For those chips, the graph below will likely not have the interesting region up to 64 uops; rather, it will just look the same as the region after 64 uops.The loop stream detector can cache small loops of up to 64 uops (on Skylake). In Intel's recent documentation it is positioned more as a power-saving mechanism than a performance feature - although there are certainly no performance downsides mentioned to using the LSD.Running this for the loop sizes that should fit in the LSD, we get the following cycles/iteration behavior:The red line here is the % of uops which are delivered from the LSD. It flatlines at 100% for all loop sizes from 5 to 56 uops.For the 3 and 4 uop loops, we have the unusual behavior that 16% and 25% of the uops, respectively, are delivered from the legacy decoder. Huh? Luckily, it doesn't seem to affect the loop throughput as both cases achieve the maximum throughput of 1 loop/cycle - despite the fact that one could expect some MITE<->LSD transition penalties.Between loop sizes of 57 and 62 uops, the number of uops delivered from LSD exhibits some weird behavior - approximately 70% of the uops are delivered from the LSD, and the rest from the DSB. Skylake nominally has a 64-uop LSD, so this is some kind of transition right before the LSD size is exceeded - perhaps there is some kind of internal alignment within the IDQ (on which the LSD is implemented) that causes only partial hits to the LSD in this phase. This phase is short and, performance-wise, seems mostly to be a linear combination of the full-in-LSD performance which precedes it, and the fully-in-DSB performance which follows it.Let's look at the main body of results between 5 and 56 uops. We see three distinct regions:Loops from 3 to 10 uops: Here, the behavior is complex. It is the only region where we see cycle counts that can't be explained by static behavior over a single loop iteration. The range is short enough that it's hard to say if there is a pattern. Loops of 4, 6 and 8 uops all execute optimally, in N/4 cycles (that's the same pattern as the next region).A loop of 10 uops, on the other hand, executes in 2.66 cycles per iteration, making it the only even loop size that doesn't execute optimally until you get to loop sizes of 34 uops or above (other than the outlier at 26). That corresponds to something like a repeated uop/cycle execution rate of 4, 4, 4, 3. For a loop of 5 uops, you get 1.33 cycles per iteration, very close but not the same as the ideal of 1.25. That corresponds to an execution rate of 4, 4, 4, 4, 3.These results are hard to explain. The results are repeatable from run to run, and robust to changes such as swapping out the nop for an instruction that actually does something like mov ecx, 123. It might be something to do with the limit of 1 taken branch every 2 cycles, which applies to all loops except those that are "very small". It might be that the uops occasionally line up such that this limitation kicks in, leading to an extra cycle. Once you get to 12 uops or above, this never occurs since you are always taking at least three cycles per iteration.Loops from 11 to 32-uops: We see a stair-step pattern, but with a period of two. Basically all loops with an even number of uops perform optimally - i.e., taking exactly N/4 cycles. Loops with odd number of uops waste one "issue slot", and take the same number of cycles as a loop with one more uops (i.e., a 17 uop loop takes the same 4.5 cycles as an 18 uop loop). So here we have behavior better than ceiling(N/4) for many uop counts, and we have the first evidence that Skylake at least can execute loops in a non-integral number of cycles.The only outliers are N=25 and N=26, which both take about 1.5% longer than expected. It's small but reproducible, and robust to moving the function around in the file. That's too small to be explained by a per-iteration effect, unless it has a giant period, so it's probably something else.The overall behavior here is exactly consistent (outside of the 25/26 anomaly) with the hardware unrolling the loop by a factor of 2.Loops from 33 to ~64 uops: We see a stair-step pattern again, but with a period of 4, and worse average performance than the up-to 32 uop case. The behavior is exactly ceiling(N/4) - that is, the same as the legacy decoder case. So for loops of 32 to 64 uops, the LSD provides no apparent benefit over the legacy decoders, in terms of front end throughput for this particular limitation. Of course, there are many other ways the LSD is better - it avoids many of the potential decoding bottlenecks that occur for more complex or longer instructions, and it saves power, etc.All of this is quite surprising, because it means that loops delivered from the uop cache generally perform better in the front end than loops delivered from the LSD, despite the LSD usually being positioned as a strictly better source of uops than the DSB (e.g., as part of advice to try to keep loops small enough to fit in the LSD).Here's another way to look at the same data - in terms of the efficiency loss for a given uop count, versus the theoretical maximum throughput of 4 uops per cycle. A 10% efficiency hit means you only have 90% of the throughput that you'd calculate from the simple N/4 formula.The overall behavior here is consistent with the hardware not doing any unrolling, which makes sense since a loop of more than 32 uops cannot be unrolled at all in a buffer of 64 uops.The three regions discussed above are colored differently, and at least competing effects are visible:Everything else being equal, the larger the number of uops involved, the lower the efficiency hit. The hit is a fixed cost only once per iteration, so larger loops pay a smaller relative cost.There is a large jump in inefficiency when you cross to into the 33+ uop region: both the size of the throughput loss increases, and the number of affected uop counts doubles.The first region is somewhat chaotic, and 7 uops is the worst overall uop count.AlignmentThe DSB and LSD analysis above is for loop entries aligned to a 32-byte boundary, but the unaligned case doesn't seem to suffer in either case: there isn't a material difference from the aligned case (other than perhaps some small variation for less than 10 uops that I didn't investigate further).Here's the unaligned results for 32N-2 and 32N+2 (i.e., the loop top 2 bytes before and after the 32B boundary):The ideal N/4 line is also shown for reference.HaswellNext next take a look at the prior microarchitecture: Haswell. The numbers here have been graciously provided by user Iwillnotexist Idonotexist.LSD + Legacy Decode PipelineFirst, the results from the "dense code" test which tests the LSD (for small uop counts) and the legacy pipeline (for larger uop counts, since the loop "busts out" of the DSB due to instruction density.Immediately we see a difference already in terms of when each architecture delivers uops from the LSD for a dense loop. Below we compare Skylake and Haswell for short loops of dense code (1 byte per instruction).As described above, the Skylake loop stops being delivered from the LSD at exactly 19 uops, as expected from the 18-uop per 32-byte region of code limit. Haswell, on the other hand, seems to stop delivering reliably from the LSD for the 16-uop and 17-uop loops as well. I don't have any explanation for this. There is also a difference in the 3-uop case: oddly both processors only deliver some of the their uops out of the LSD in the 3 and 4 uop cases, but the exact amount is the same for 4 uops, and different from 3.We mostly care about the actual performance though, right? So let's look at the cycles/iteration for the 32-byte aligned dense code case:This is the same data as show above for Skylake (the misaligned series has been removed), with Haswell plotted alongside. Immediately you notice that the pattern is similar for Haswell, but not the same. As above, there are two regions here:Legacy DecodeThe loops larger than ~16-18 uops (the uncertainty is described above) are delivered from the legacy decoders. The pattern for Haswell is somewhat different from Skylake.For the range from 19-30 uops they are identical, but after that Haswell breaks the pattern. Skylake took ceil(N/4) cycles for loops delivered from the legacy decoders. Haswell, on the other hand, seems to take something like ceil((N+1)/4) + ceil((N+2)/12) - ceil((N+1)/12). OK, that's messy (shorter form, anyone?) - but basically it means that while Skylake executes loops with 4*N cycles optimally (i.e,. at 4-uops/cycle), such loops are (locally) usually the least optimal count (at least locally) - it takes one more cycle to execute such loops than Skylake. So you are actually best off with loops of 4N-1 uops on Haswell, except that the 25% of such loops that are also of the form 16-1N (31, 47, 63, etc) take one additional cycle. It's starting to sound like a leap year calculation - but the pattern is probably best understood visually above.I don't think this pattern is intrinsic to uop dispatch on Haswell, so we shouldn't read to much into it. It seems to be explained by 0000000000455a80 <short_nop_aligned35.top>:16B cycle 1 1 455a80: ff c8 dec eax 1 1 455a82: 90 nop 1 1 455a83: 90 nop 1 1 455a84: 90 nop 1 2 455a85: 90 nop 1 2 455a86: 90 nop 1 2 455a87: 90 nop 1 2 455a88: 90 nop 1 3 455a89: 90 nop 1 3 455a8a: 90 nop 1 3 455a8b: 90 nop 1 3 455a8c: 90 nop 1 4 455a8d: 90 nop 1 4 455a8e: 90 nop 1 4 455a8f: 90 nop 2 5 455a90: 90 nop 2 5 455a91: 90 nop 2 5 455a92: 90 nop 2 5 455a93: 90 nop 2 6 455a94: 90 nop 2 6 455a95: 90 nop 2 6 455a96: 90 nop 2 6 455a97: 90 nop 2 7 455a98: 90 nop 2 7 455a99: 90 nop 2 7 455a9a: 90 nop 2 7 455a9b: 90 nop 2 8 455a9c: 90 nop 2 8 455a9d: 90 nop 2 8 455a9e: 90 nop 2 8 455a9f: 90 nop 3 9 455aa0: 90 nop 3 9 455aa1: 90 nop 3 9 455aa2: 90 nop 3 9 455aa3: 75 db jne 455a80 <short_nop_aligned35.top>Here I've noted the 16B decode chunk (1-3) each instruction appears in, and the cycle in which it will be decoded. The rule is basically that up to the next 4 instructions are decoded, as long as they fall in the current 16B chunk. Otherwise they have to wait until the next cycle. For N=35, we see that there is a loss of 1 decode slot in cycle 4 (only 3 instruction are left in the 16B chunk), but that otherwise the loop lines up very well with the 16B boundaries and even the last cycle (9) can decode 4 instructions.Here's a truncated look at N=36, which is identical except for the end of the loop:0000000000455b20 <short_nop_aligned36.top>:16B cycle 1 1 455a80: ff c8 dec eax 1 1 455b20: ff c8 dec eax 1 1 455b22: 90 nop ... [29 lines omitted] ... 2 8 455b3f: 90 nop 3 9 455b40: 90 nop 3 9 455b41: 90 nop 3 9 455b42: 90 nop 3 9 455b43: 90 nop 3 10 455b44: 75 da jne 455b20 <short_nop_aligned36.top>There are now 5 instructions to decode in the 3rd and final 16B chunk, so one additional cycle is needed. Basically 35 instructions, for this particular pattern of instructions happens to line up better with the 16B bit boundaries and saves one cycle when decoding. This doesn't mean that N=35 is better than N=36 in general! Different instructions will have different numbers of bytes and will line up differently. A similar alignment issue explains also the additional cycle that is required every 16 bytes: 16B cycle... 2 7 45581b: 90 nop 2 8 45581c: 90 nop 2 8 45581d: 90 nop 2 8 45581e: 90 nop 3 8 45581f: 75 df jne 455800 <short_nop_aligned31.top>Here the final jne has slipped into the next 16B chunk (if an instruction spans a 16B boundary it is effectively in the latter chunk), causing an extra cycle loss. This occurs only every 16 bytes.So the Haswell legacy decoder results are explained perfectly by a legacy decoder that behaves as described, for example, in Agner Fog's microarchitecture doc. In fact, it also seems to explain Skylake results if you assume Skylake can decode 5 instructions per cycle (delivering up to 5 uops). Assuming it can, the asymptotic legacy decode throughput on this code for Skylake is still 4-uops, since a block of 16 nops decodes 5-5-5-1, versus 4-4-4-4 on Haswell, so you only get benefits at the edges: in the N=36 case above, for example, Skylake can decode all remaining 5 instructions, versus 4-1 for Haswell, saving a cycle.The upshot is that it seems to be that the legacy decoder behavior can be understood in a fairly straightforward manner, and the main optimization advice is to continue to massage code so that it falls "smartly" into the 16B aligned chunks (perhaps that's NP-hard like bin packing?).DSB (and LSD again)Next let's take a look at the scenario where the code is served out of the LSD or DSB - by using the "long nop" test which avoids breaking the 18-uop per 32B chunk limit, and so stays in the DSB.Haswell vs Skylake:Note the LSD behavior - here Haswell stops serving out of the LSD at exactly 57 uops, which is completely consistent with the published size of the LSD of 57 uops. There is no weird "transition period" like we see on Skylake. Haswell also has the weird behavior for 3 and 4 uops where only ~0% and ~40% of the uops, respectively, come from the LSD.Performance-wise, Haswell is normally in-line with Skylake with a few deviations, e.g., around 65, 77 and 97 uops where it rounds up to the next cycle, whereas Skylake is always able to sustain 4 uops/cycle even when that's results in a non-integer number of cycles. The slight deviation from expected at 25 and 26 uops has disappeared. Perhaps the 6-uop delivery rate of Skylake helps it avoid uop-cache alignment issues that Haswell suffers with its 4-uop delivery rate.Other ArchitecturesResults for the following additional architectures were kindly provided by user Andreas Abel, but we'll have to use another answer for further analysis as we are at the character limit here.Help NeededAlthough results for many platforms have been kindly offered by the community, I'm still interested in results on chips older than Nehalem, and newer than Coffee Lake (in particular, Cannon Lake, which is a new uarch). The code to generate these results is public. Also, the results above are available in .ods format in GitHub as well. In particular, the legacy decoder maximum throughput apparently increased from 4 to 5 uops in Skylake, and the maximum throughput for the uop cache increased from 4 to 6. Both of those could impact the results described here. Intel actually like to call the legacy decoder the MITE (Micro-instruction Translation Engine), perhaps because it's a faux-pas to actually tag any part of your architecture with the legacy connotation. Technically there is another, even slower, source of uops - the MS (microcode sequencing engine), which is used to implement any instruction with more than 4 uops, but we ignore this here since none of our loops contain microcoded instructions. This works because any aligned 32-byte chunk can use at most 3-ways in its uop cache slot, and each slot holds up to 6 uops. So if you use more than 3 * 6 = 18 uops in a 32B chunk, the code can't be stored in the uop cache at all. It's probably rare to encounter this condition in practice, since the code needs to be very dense (less than 2 bytes per instruction) to trigger this. The nop instructions decode to one uop, but don't are eliminated prior to execution (i.e., they don't use an execution port) - but still take up space in the front end and so count against the various limits that we are interested in. The LSD is the loop stream detector, which caches small loops of up to 64 (Skylake) uops directly in the IDQ. On earlier architectures it can hold 28 uops (both logical cores active) or 56 uops (one logical core active). We can't easily fit a 2 uop loop in this pattern, since that would mean zero nop instructions, meaning the dec and jnz instructions would macro-fuse, with a corresponding change in the uop count. Just take my word that all loops with 4 or less uops execute at best at 1 cycle/iteration. For fun, I just ran perf stat against a short run of Firefox where I opened a tab and clicked around on a few Stack Overflow questions. For instructions delivered, I got 46% from DSB, 50% from legacy decoder and 4% for LSD. This shows that at least for big, branchy code like a browser the DSB still can't capture the large majority of the code (lucky the legacy decoders aren't too bad). By this, I mean that all the other cycle counts can be explained by simply by taking an "effective" integral loop cost in uops (which might be higher than the actual size is uops) and dividing by 4. For these very short loops, this doesn't work - you can't get to 1.333 cycles per iteration by dividing any integer by 4. Said another way, in all other regions the costs have the form N/4 for some integer N. In fact we know that Skylake can deliver 5 uops per cycle from the legacy decoder, but we don't know if those 5 uops can come from 5 different instructions, or only 4 or less. That is, we expect that Skylake can decode in the pattern 2-1-1-1, but I'm not sure if it can decode in the pattern 1-1-1-1-1. The above results give some evidence that it can indeed decode 1-1-1-1-1. 这篇关于当执行其uop计数不是处理器宽度的倍数的循环时,性能会降低吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 上岸,阿里云!
09-03 06:47