为什么英特尔宣传的Haswell AVX延迟比Sandy Bridge慢3倍？

本文介绍了为什么英特尔宣传的Haswell AVX延迟比Sandy Bridge慢3倍？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述 29岁程序员，3月因学历无情被辞！在英特尔内部Web应用程序，从Sandy Bridge到哈斯韦尔。例如，许多插入操作（例如_mm256_insertf128_si256）显示如下成本表：性能体系结构延迟吞吐量 Haswell 3-常春藤桥1-桑迪桥1- 我发现这种差异令人困惑。是否存在差异，是因为有新的指令代替了这些指令或某些补偿指令（哪些）？有谁知道Skylake是否会进一步改变这种模式？解决方案 TL：DR ：所有跨车道改组/插入/提取在Haswell上的延迟均为3c / aSkylake，但SnB / IvB上的延迟为2c，根据 Agner Fog的测试。这可能是执行单元中的1c +某种不可避免的旁路延迟，因为通过Broadwell的SnB具有1、3或5个周期的标准延迟，而从来没有2或4个周期。（SKL制作了一些uops oups 4c，包括FMA / ADDPS / MULPS）。），insert128 / extract128比像VPERM2F128这样的改组要快得多。）内在函数指南有时会伪造数据。我假设这是用于reg-reg形式的指令，除非是加载内在函数。即使是正确的，内在函数指南也无法提供非常详细的性能信息；（我的一个带有内在函数的小玩意是很难使用 PMOVZX / PMOVSX 作为负载，因为所提供的唯一内在函数采用 __ m128i 源，即使 pmovzxbd 仅加载4B或8B（ymm）。它和/或广播加载（ _mm_set1 _ * 和AVX1 / 2）是压缩内存中常量的好方法。应该有带 const char * 的内在函数（因为允许对任何事物进行别名））。在这种情况下， Agner Fog的测量显示，对于reg-reg vinsertf128 / vextractf128 ，而他对Haswell的测量（3c延迟，每1c tput 1个）与Intel的表一致。因此，在另一种情况下，英特尔内在函数指南中的数字是错误的。这是找到合适的内在函数的好方法，但不是获得可靠性能数字的良好来源。它并不能告诉您任何有关执行端口或总操作数的信息，并且甚至会忽略吞吐量数字。延迟通常不是矢量整数代码中的限制因素。这可能就是英特尔为何让Haswell的延迟增加的原因。记忆形式明显不同。 vinsertf128 y，y，m，i 的经纬度为：IvB：4/1，Haswell / BDW：4/2，SKL：5 / 0.5。它始终是2 uop指令（融合域），使用一个ALU uop。 IDK为什么吞吐量如此不同。也许Agner的测试略有不同？有趣的是， vextractf128 mem，reg，我没有使用任何ALU运算符。这是2个融合域uop指令，仅使用存储数据和存储地址端口，而不使用随机播放单元。（Agner Fog的表将其列为在SnB上使用一个p015 uop，在IvB上使用0。但是即使在SnB上，任何特定列中都没有标记，因此IDK哪一个是正确的。） vextractf128 在立即数上浪费一个字节是很愚蠢的。我猜他们不知道他们要在下一个矢量长度扩展中使用EVEX，并且正在准备从0..3开始的立即数。但是对于AVX1 / 2，永远不要使用立即数= 0的指令。相反，只需 movups mem，xmm 或 movaps xmm，xmm 。（我认为编译器知道这一点，当您使用索引= 0的内在函数时，就像它们对 _mm_extract_epi32 所做的那样，依此类推（ movd ）。延迟在FP代码中通常是一个因素，而Skylake是一个怪物用于FP ALU。他们设法将FMA的延迟降低到4个周期，因此mulps / addps / fma ... ps均为4c延迟，每0.5c吞吐量中有一个。（Broadwell是mulps / addps = 3c潜伏期，fma = 5c潜伏期。Haswell是addps = 3c潜伏期，mul / fma = 5c）。 Skylake删除了单独的添加单元，因此addps实际上从3c恶化到4c，但吞吐量提高了一倍。（Haswell / BDW仅以每1c吞吐量进行一次addps，是mul / fma的一半。）在大多数FP算法中，必须使用许多矢量累加器对于一次保持8到10个FMA的飞行至关重要。如果存在循环携带的依赖关系，则会使吞吐量饱和。否则，如果循环体足够小，乱序执行将一次运行多个迭代。整数行内操作通常只有1c的延迟，因此，您需要的并行度要少得多，以最大化吞吐量（并且不受延迟限制）。无其他使数据进出ymm的其他方法中的任何一种都更好 vperm2f128 或AVX2 vpermps 更昂贵。遍历内存会导致存储转发失败->插入延迟大（2个窄存储->负载大），因此显然很糟糕。在有用的情况下，请勿尝试避免使用 vinsertf128 。和往常一样，请尝试使用最便宜的指令可能的顺序。例如对于水平总和或其他缩减，请始终先将其缩减为128b向量，因为跨车道混洗速度很慢。通常只是 vextractf128 / addps xmm ，然后是通常的水平128b 。正如Mysticial所提到的，对于128b向量，Haswell及其后来的SnB / IvB的通道内向量洗牌吞吐量只有一半。 SnB / IvB可以 pshufb / pshufd 具有0.5c的吞吐量之一，但是对于 shufps （甚至是128b版本）；对于其他在AVX1中具有ymm版本的洗牌（例如 vpermilps ，显然只存在，因此FP加载和洗牌可以在一条指令中完成）也是如此。 Haswell完全放弃了port1上的128b随机播放单元，而不是将其扩展为AVX2。 re：skylake Agner Fog的指南/ insn表在12月进行了更新，其中包括Skylake。另请参见 x86 标签Wiki的问题，以获取更多链接。 reg，reg表单的性能与Haswell / Broadwell相同。 In the Intel intrinsics webapp, several operations seem to have worsened from Sandy Bridge to Haswell. For example, many insert operations like _mm256_insertf128_si256 show a cost table like the following: Performance Architecture Latency Throughput Haswell 3 - Ivy Bridge 1 - Sandy Bridge 1 -I found this difference puzzling. Is this difference because there are new instructions that replace these ones or something that compensates for it (which ones)? Does anyone know if Skylake changes this model further? 解决方案 TL:DR: all lane-crossing shuffles / inserts / extracts have 3c latency on Haswell/Skylake, but 2c latency on SnB/IvB, according to Agner Fog's testing.This is probably 1c in the execution unit + an unavoidable bypass delay of some sort, because the actual execution units in SnB through Broadwell have standardized latencies of 1, 3, or 5 cycles, never 2 or 4 cycles. (SKL makes some uops uops 4c, including FMA/ADDPS/MULPS).(Note that on AMD CPUs that do AVX1 with 128b ALUs (e.g. Bulldozer/Piledriver/Steamroller), insert128/extract128 are much faster than shuffles like VPERM2F128.)The intrinsics guide has bogus data sometimes. I assume it's meant to be for the reg-reg form of instructions, except in the case of the load intrinsics. Even when it's correct, the intrinsics guide doesn't give a very detailed picture of performance; see below for discussion of Agner Fog's tables/guides.(One of my pet peeves with intrinsics is that it's hard to use PMOVZX / PMOVSX as a load, because the only intrinsics provided take a __m128i source, even though pmovzxbd only loads 4B or 8B (ymm). It and/or broadcast-loads (_mm_set1_* with AVX1/2) are great way to compress constants in memory. There should be intrinsics that take a const char* (because that's allowed to alias anything)).In this case, Agner Fog's measurements show that SnB/IvB have 2c latency for reg-reg vinsertf128/vextractf128, while his measurements for Haswell (3c latency, one per 1c tput) agree with Intel's table. So it's another case where the numbers in Intel's intrinsics guide are wrong. It's great for finding the right intrinsic, but not a good source for reliable performance numbers. It doesn't tell you anything about execution ports or total uops, and often omits even the throughput numbers. Latency is often not the limiting factor in vector integer code anyway. This is probably why Intel let the latencies increase for Haswell.The reg-mem form is significantly different. vinsertf128 y,y,m,i has lat/recip-tput of: IvB:4/1, Haswell/BDW:4/2, SKL:5/0.5. It's always a 2-uop instruction (fused domain), using one ALU uop. IDK why the throughput is so different. Maybe Agner tested slightly differently?Interestingly, vextractf128 mem,reg, i doesn't use any ALU uops. It's a 2-fused-domain-uop instruction that only uses the store-data and store-address ports, not the shuffle unit. (Agner Fog's table lists it as using one p015 uop on SnB, 0 on IvB. But even on SnB, doesn't have a mark in any specific column, so IDK which one is right.)It's silly that vextractf128 wastes a byte on an immediate operand. I guess they didn't know they were going to use EVEX for the next vector length extension, and were preparing for the immediate to go from 0..3. But for AVX1/2, you should never use that instruction with the immediate = 0. Instead, just movups mem, xmm or movaps xmm,xmm. (I think compilers know this, and do that when you use the intrinsic with index = 0, like they do for _mm_extract_epi32 and so on (movd).)Latency is more often a factor in FP code, and Skylake is a monster for FP ALUs. They managed to drop the latency for FMA to 4 cycles, so mulps/addps/fma...ps are all 4c latency with one per 0.5c throughput. (Broadwell was mulps/addps = 3c latency, fma = 5c latency. Haswell was addps=3c latency, mul/fma=5c). Skylake dropped the separate add unit, so addps actually worsened from 3c to 4c, but with twice the throughput. (Haswell/BDW only did addps with one per 1c throughput, half that of mul/fma.) So using many vector accumulators is essential in most FP algorithms for keeping 8 or 10 FMAs in flight at once to saturate the throughput, if there's a loop-carried dependency. Otherwise if the loop body is small enough, out-of-order execution will have multiple iterations in flight at once.Integer in-lane ops are typically only 1c latency, so you need a much smaller amount of parallelism to max out the throughput (and not be limited by latency).None of the other options for getting data into/out-of the high half of a ymm are any bettervperm2f128 or AVX2 vpermps are more expensive. Going through memory will cause a store-forwarding failure -> big latency for insert (2 narrow stores -> wide load), so it's obviously bad. Don't try to avoid vinsertf128 in cases where it's useful.As always, try to use the cheapest instruction sequences possible. e.g. for a horizontal sum or other reduction, always reduce down to a 128b vector first, because cross-lane shuffles are slow. Usually it's just vextractf128 / addps xmm, then the usual horizontal 128b.As Mysticial alluded to, Haswell and later have half the in-lane vector shuffle throughput of SnB/IvB for 128b vectors. SnB/IvB can pshufb / pshufd with one per 0.5c throughput, but only one per 1c for shufps (even the 128b version); same for other shuffles that have a ymm version in AVX1 (e.g. vpermilps, which apparently exists only so FP load-and-shuffle can be done in one instruction). Haswell got rid of the 128b shuffle unit on port1 altogether, instead of widening it for AVX2.re: skylakeAgner Fog's guides/insn tables were updated in December to include Skylake. See also the x86 tag wiki for more links. The reg,reg form has the same performance as in Haswell/Broadwell. 这篇关于为什么英特尔宣传的Haswell AVX延迟比Sandy Bridge慢3倍？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！上岸，阿里云！