英特尔固有技术指南-延迟和吞吐量

本文介绍了英特尔固有技术指南-延迟和吞吐量的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述 29岁程序员，3月因学历无情被辞！有人可以解释英特尔内在指南中给出的延迟和吞吐量值?我是否正确理解延迟是指一条指令运行所需的时间量，吞吐量是每个时间单位可以启动的指令数?如果我的定义正确，为什么在较新的CPU版本(例如mulps)上某些指令的等待时间更长?解决方案该表遗漏了:Broadwell上的MULPS延迟:3. Skylake上:4.在这种情况下，内在查找器的延迟是准确的，尽管有时不符合Agner Fog的实验测试. (该VEXTRACTF128延迟可能是Intel在其表中未包括旁路延迟的情况). 请参见我在这个链接的问题上的答案，以了解有关处理吞吐量和延迟数以及它们对于现代乱序CPU的含义的更多详细信息.MULPS潜伏期确实从4(Nehalem)增加到5(Sandybridge).这可能是为了节省功率或晶体管，但更可能是因为SandyBridge将uop延迟标准化为仅几个不同的值，以避免写回冲突:即当同一执行单元将在同一周期中产生两个结果时，例如从开始一个周期为2c的循环开始，然后在下一个周期为1c的循环开始.这简化了uop调度程序，该调度程序将uops从预留站调度到执行单元.或多或少以最早的顺序排列，但必须过滤输入准备就绪的对象.调度程序非常耗电，这是乱序执行的电源成本的重要组成部分. (不幸的是，使调度程序以关键路径优先的顺序选择uops，以避免) 阿格纳·福格(Agner Fog)解释了同样的事情(在他的微拱pdf的SnB部分中):嗯，我刚刚意识到Agner的VEXTRACTF128 xmm, ymm, imm8数字很奇怪. Agner将其列为SnB上的1 uop 2c延迟，但Intel则将其列为1c延迟(慢，如此处所述).也许执行单位是1c延迟，但是在使用结果之前，有一个内置的1c旁路延迟(用于车道交叉吗?).那可以解释英特尔的数字与Agner的实验测试之间的差异.某些指令仍然是2c延迟，因为它们解码为2个相关的oups，每个1u延迟. MULPS是单个uop，甚至是AVX 256b版本，因为即使Intel的第一代AVX CPU也具有全角256b执行单元(除法/sqrt单元除外).需要FP倍增器电路副本的倍数是对其进行优化以节省晶体管的一个很好的理由，而这是以等待时间为代价的. 通过搜索Agner的表格，该模式可保留并包括Broadwell，AFAICT. (使用LibreOffice，我选择了整个表，并进行了data-> filter-> standard过滤器，并查找C列= 1且F列= 4的行.(然后重复2.)不会加载或存储. Haswell坚持仅使用1、3和5个周期的ALU uop延迟的模式(AESENC/AESDEC除外，对于具有7c延迟的port5来说是1 uop.当然还有DIVPS和SQRTPS).还有CVTPI2PS xmm, mm，延迟为1 uop 4c，但对于p1 uop而言为3c，对于旁路延迟而言为1c，这是Agner Fog进行测量或不可避免的方式. VMOVMSKPS r32, ymm也是2c(r32，xmm版本是3c). Broadwell将MULPS延迟降低到3，与ADDPS相同，但将FMA保持在5c.据推测，他们想出了如何在不需要加法的情况下将FMA单元捷径化以产生乘数的方法. Skylake能够处理延迟= 4的uops . FMA，ADDPS/D和MULPS/D的延迟= 4个周期. (SKL删除了专用的矢量FP加法单元，并使用FMA单元完成了所有工作.因此，ADDPS/D吞吐量增加了一倍，以匹配MULPS/D和FMA ... PS/D.以及如果他们不想在不严重损害ADDPS延迟的情况下放弃vec-FP加法器，是否将完全引入4c延迟指令.)其他具有4c延迟的SKL指令:PHMINPOSUW(低于5c)，AESDEC/AESENC，CVTDQ2PS(高于3c，但这可能是3c +旁路)，RCPPS(低于5c)，RSQRTPS，CMPPS/D(高于从3c).嗯，我猜想FP比较是在加法器中完成的，现在必须使用FMA. MOVD r32, xmm和MOVD xmm, r32被列为2c，也许是从int-vec到int的旁路延迟?还是Agner测试中的故障?测试等待时间将需要其他指令才能创建回退到xmm的往返行程.在HSW上是1分. Agner将SKL MOVQ r64, xmm列为2个周期(端口0)，但是将MOVQ xmm, r64列为1c(端口5)，并且读取64位寄存器比读取32位寄存器的速度似乎很奇怪.艾格纳(Agner)过去在他的桌子上犯了错误.这可能是另一个.Can somebody explain the Latency and the Throughput values given in the Intel Intrinsic Guide?Have I understood it correctly that the latency is the amount of time units an instruction takes to run, and the throughput is the number of instructions that can be started per time unit?If my definition is correct, why is the latency for some instructions higher on newer CPU versions (e.g. mulps)? 解决方案 Missing from that table: MULPS latency on Broadwell: 3. On Skylake: 4.The intrinsic finder's latency is accurate in this case, although it occasionally doesn't match Agner Fog's experimental testing. (That VEXTRACTF128 latency may be a case of Intel not including a bypass delay in their table). See my answer on that linked question for more details about what to do with throughput and latency numbers, and what they mean for a modern out-of-order CPU.MULPS latency did increase from 4 (Nehalem) to 5 (Sandybridge). This may have been to save power or transistors, but more likely because SandyBridge standardized uop latencies to only a few different values, to avoid writeback conflict: i.e. when the same execution unit would produce two results in the same cycle, e.g. from starting a 2c uop one cycle, then a 1c uop the next cycle.This simplifies the uop scheduler, which dispatches uops from the Reservation Station to the execution units. More or less in oldest-first order, but it has has to filter by which ones have their inputs ready. The scheduler is power-hungry, and this is a significant part of the power cost of out-of-order execution. (It's unfortunately not practical to make a scheduler that picks uops in critical-path-first order, to avoid having independent uops steal cycles from the critical path with resource conflicts.)Agner Fog explains the same thing (in the SnB section of his microarch pdf):Hmm, I just realized that Agner's numbers for VEXTRACTF128 xmm, ymm, imm8 are weird. Agner lists it as 1 uop 2c latency on SnB, but Intel lists it as 1c latency (as discussed here). Maybe the execution unit is 1c latency, but there's a built-in 1c bypass delay (for lane-crossing?) before you can use the result. That would explain the discrepancy between Intel's numbers and Agner's experimental test.Some instructions are still 2c latency, because they decode to 2 dependent uops that are each 1c latency. MULPS is a single uop, even the AVX 256b version, because even Intel's first-gen AVX CPUs have full-width 256b execution units (except the divide/sqrt unit). Needing twice as many copies of the FP multiplier circuitry is a good reason for optimizing it to save transistors at the cost of latency.This pattern holds up to and including Broadwell, AFAICT from searching Agner's tables. (Using LibreOffice, I selected the whole table, and did data->filter->standard filter, and looked for rows with column C = 1 and column F = 4. (And then repeat for 2.) Look for any uops that aren't loads or stores.Haswell sticks to the pattern of only 1, 3 and 5 cycle ALU uop latencies (except for AESENC/AESDEC, which is 1 uop for port5 with 7c latency. And of course DIVPS and SQRTPS). There's also CVTPI2PS xmm, mm, at 1 uop 4c latency, but maybe that's 3c for the p1 uop and 1c of bypass delay, the way Agner Fog measured it or unavoidable. VMOVMSKPS r32, ymm is also 2c (vs. 3c for the r32,xmm version).Broadwell dropped MULPS latency to 3, same as ADDPS, but kept FMA at 5c. Presumably they figured out how to shortcut the FMA unit to produce just a multiply when no add was needed.Skylake is able to handle uops with latency=4. Latency for FMA, ADDPS/D, and MULPS/D = 4 cycles. (SKL drops the dedicated vector-FP add unit, and does everything with the FMA unit. So ADDPS/D throughput is doubled to match MULPS/D and FMA...PS/D. I'm not sure which change motivated what, and whether they would have introduced 4c latency instructions at all if they hadn't wanted to drop the vec-FP adder without hurting ADDPS latency too badly.)Other SKL instructions with 4c latency: PHMINPOSUW (down from 5c), AESDEC/AESENC, CVTDQ2PS (up from 3c, but this might be 3c + bypass), RCPPS (down from 5c), RSQRTPS, CMPPS/D (up from 3c). Hmm, I guess FP compares were done in the adder, and now have to use FMA.MOVD r32, xmm and MOVD xmm, r32 are listed as 2c, perhaps a bypass delay from int-vec to int? Or a glitch in Agner's testing? Testing the latency would require other instructions to create a round-trip back to xmm. It's 1c on HSW. Agner lists SKL MOVQ r64, xmm as 2 cycles (port0), but MOVQ xmm, r64 as 1c (port5), and it seems extremely weird that reading a 64-bit register is faster than reading a 32-bit register. Agner has had mistakes in his table in the past; this may be another. 这篇关于英特尔固有技术指南-延迟和吞吐量的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！上岸，阿里云！