使用AVX512,您可以正常进行add/sub/mul/div/sqrt计算(使用默认舍入),然后再次使用舍入模式覆盖将其截断为0.使用 vcmpps 等同于结果.比较完全相等的元素将通过默认的舍入模式向0舍入(或两次均精确).当然,您可以使用+ Inf的-Inf作为替代来检测它,而不是使用0. AVX512的EVEX前缀可以在不更改MXCSR的情况下,按指令对舍入模式替代进行编码.与更改MXCSR相比,这可以高效地执行此操作.例如 _mm512_add_round_ps(__m512 a,__m512 b,int); .请注意,AVX512嵌入式四舍五入( er )仅适用于512位向量.不幸的是,您不能将其与AVX512VL一起使用来对256位向量进行舍入替代,以避免当前的max-turbo以及在当前的Skylake系列CPU上使用512位向量的其他缺点.使用ER还会应用SAE(全部禁止例外),这意味着该指令根本不需要更新MXCSR.AVX-512指令编码-{er}含义.在asm语法中, rz =向零舍入.请参阅表2-36.EVEX嵌入式广播/环绕/SAE和矢量指示上的矢量长度在Intel的第2卷x86手册中. vaddpd zmm2,zmm1,zmm0;没有覆盖,否则{rne-sae}将是最近的vaddpd zmm3,zmm1,zmm0,{rz-sae};四舍五入=截断向零vcmpneqpd k1,zmm2,zmm3;比较不相等;;;k1 =位掩码;;0表示四舍五入为0或精确;;1表示从0舍入 如果不需要主要结果为512位向量,则可以执行此操作,并且可以与XMM或YMM寄存器进行比较,但是 {rz-sae} 操作必须是ZMM.YMM比较使您可以选择与另一个YMM寄存器(AVX1)进行比较,而不是与AVX512屏蔽寄存器进行比较.但是,如果您使用的是AVX512,则掩码寄存器通常会很好.这总是需要2条额外的指令:重复操作和比较.如果仅直接使用符号位而不是与零进行比较,Mysticial建议在 mulps 之后使用FMA可能会避免这种情况.例如 vmovmskps 获取整数位图,或者 vxorps 或 vandps 组合一些矢量,其中您关心的真值"是符号位.这可能是 vblendvps 的输入(也仅查看符号位),或者是最终的 vmovmskps 的输入.在没有AVX512的情况下更改舍入模式可能并不完全是灾难,特别是如果您可以在更改为截断并重做它们之前使用默认方式进行一些矢量处理的话.如果您有足够的寄存器来摊销MXCSR在足够的操作上的变化,那么可能使其效率比舍入方向检测序列(每个向量需要3条或更多条指令)的效率更高.显然,某些Intel CPU确实重命名了MXCSR. MXCSR重命名档的perf事件周期存在于某些微体系结构上(不确定哪个):由于MXCSR寄存器重命名而导致的停顿与之前的MXCSR重命名过于接近.因此,进行更改不必耗尽调度程序,但这并不是很好.根据该措辞,在附近将其更改两次可能是不好的.IDK(如果可能只是数量有限的物理MXCSR条目要重命名),或其他一些原因导致此限制.当然,在循环中,您不会存储,进行位翻转和重新加载MXCSR值;您的内存中有两个MXCSR值,只是 ldmxcsr .One of the purposes of the C1 bit in the x87 FPU status word is to show whether or not an inexact result was rounded up.Does SSE/AVX provide any such indication for scalar operations?I did not see a similar bit in the MXCSR register. Am I forced to use x87 instructions if I want this information? 解决方案 SSE/AVX do not provide hardware support for detecting this, even for scalar instructions like addss. SSE was designed for SIMD, with 4 floats per XMM vector, and presumably Intel didn't want to provide a bitmap of 4 bits in the MXCSR. Although that would have been a possible design choice.As @Mysticial points out in comments, it can be possible to calculate it using extra instructions.(Untested idea that might do what you want. I think this should work even with subnormals and so on; compare for exact equality is the same as bitwise compare except for -0.0 == +0.0, or for NaN)With AVX512, you might do your add/sub/mul/div/sqrt calculation normally (with default rounding), then again with a rounding-mode override to truncation towards 0. Use vcmpps for equality on the results. The elements that compare exactly equal were rounded toward 0 by the default rounding mode (or were exact both times). Of course you could use towards -Inf of towards +Inf as your override to detect that instead of toward 0.AVX512's EVEX prefix can encode a rounding mode override on a per-instruction basis, without changing MXCSR. This makes it efficiently possible to do this, significantly more efficiently than changing MXCSR. e.g. _mm512_add_round_ps (__m512 a, __m512 b, int);. Note that AVX512 embedded-rounding (er) is only available for 512-bit vectors; you unfortunately can't use it with AVX512VL to do rounding overrides on 256-bit vectors to avoid the current max-turbo and other downsides of using 512-bit vectors on current Skylake-family CPUs. Using ER also applies SAE (suppress-all-exceptions), meaning the instruction doesn't have to update MXCSR at all. AVX-512 Instruction Encoding - {er} Meaning.In asm syntax, rz = round toward zero. See Table 2-36. EVEX Embedded Broadcast/Rounding/SAE and Vector Length on Vector Instructions in Intel's vol.2 x86 manual. vaddpd zmm2, zmm1, zmm0 ; no override, or {rne-sae} would be Nearest-Even vaddpd zmm3, zmm1, zmm0, {rz-sae} ; rounding = truncation toward Zero vcmpneqpd k1, zmm2, zmm3 ; compare for not-equal ;;; k1 = bitmask ;; 0 means rounded toward 0 or exact ;; 1 means rounded away from 0If you don't need the primary result to be a 512-bit vector, you can do that and the compare with XMM or YMM registers, but the {rz-sae} operation has to be ZMM. YMM compare gives you the option of comparing into another YMM register (AVX1) instead of into an AVX512 mask register. But if you're using AVX512, mask registers are usually pretty nice.This always needs 2 extra instructions: repeating the operation and a compare. Mysticial's suggestion to use an FMA after mulps might avoid that, if you just use the sign bit directly instead of comparing against zero. e.g. vmovmskps to get an integer bitmap, or vxorps or vandps to combine some vectors where the "truth value" you care about is the sign bit. This might be an input for vblendvps (which also only looks at sign bits), or for an eventual vmovmskps.Changing the rounding mode without AVX512 might not be a total disaster, especially if you can do a few vectors with default before changing to truncation and redoing them. That might make it more efficient than a rounding-direction-detection sequence that took 3 or more instructions per vector if you have enough registers to play with to amortize the MXCSR changes over enough operations.Apparently some Intel CPUs do rename MXCSR; a perf event for MXCSR rename stall cycles exists on some microarchitecture (not sure which): Stalls due to the MXCSR register rename occurring too close to a previous MXCSR rename.So changing it wouldn't have to drain the scheduler, but it's not great. And according to that wording, changing it twice nearby could be bad. IDK if there's maybe just a limited amount of physical MXCSR entries to rename onto, or some other reason for that limitation.Of course in a loop you wouldn't store, bit-flip, and reload MXCSR values; you have two MXCSR values in memory and just ldmxcsr them. 这篇关于SSE/AVX是否提供确定结果是否舍入的方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 上岸,阿里云!
08-13 15:05