问题描述
哪些指令将被用于比较两个128组成的位向量4 * 32位浮点值?
Which instructions would be used for comparing two 128 bit vectors consisting of 4 * 32-bit floating point values?
有没有考虑双方作为平等NaN值的指令?如果没有,有多大将一种解决方法对性能的影响,提供了反身性(即NaN等于NAN)是什么?
Is there an instruction that considers a NaN value on both sides as equal? If not, how big would the performance impact of a workaround that provides reflexivity (i.e. NaN equals NaN) be?
我听说确保反身将有显著影响性能与IEEE语义,在那里为NaN不等于自身比较,我不知道是否大到影响会。
I heard that ensuring reflexivity would have a significant performance impact compared with IEEE semantics, where NaN doesn't equal itself, and I'm wondering if big that impact would be.
的我知道你通常需要使用小量的比较,而不是确切的质量处理浮点值时。但这个问题是关于确切的平等的比较,这可以例如用来消除散列组重复的值。的
要求
-
+ 0
和-0
必须相等。 -
NaN的
必须比较本身相等。 - 重不同$ P $ NaN的的psentations应该是平等的,但如果对性能的影响太大了这一要求可能会被牺牲掉。
- 结果应该是一个布尔值,
真正
如果四个浮动元素是两个向量相同的假如果至少一个元素不同。其中,按标量整数
1
和假$ psented真正
重新$ P code>按0
。
+0
and-0
must compare as equal.NaN
must compare equal with itself.- Different representations of NaN should be equal, but that requirement might be sacrificed if the performance impact is too big.
- The result should be a boolean,
true
if all four float elements are the same in both vectors and false if at least one element differs. Wheretrue
is represented by a scalar integer1
andfalse
by0
.
测试用例
(NaN, 0, 0, 0) == (NaN, 0, 0, 0) // for all representations of NaN
(-0, 0, 0, 0) == (+0, 0, 0, 0) // equal despite different bitwise representations
(1, 0, 0, 0) == (1, 0, 0, 0)
(0, 0, 0, 0) != (1, 0, 0, 0) // at least one different element => not equal
(1, 0, 0, 0) != (0, 0, 0, 0)
我实现这个想法
我想这可能是可能的两个 NotLessThan
比较相结合( CMPNLTPS
?)使用和
来实现所需的结果。汇编器相当于 AllTrue的((X<!Y)和(Y< X))
或 AllFalse((X< Y)或(Y> X)。
I think it might be possible to combine two NotLessThan
comparisons (CMPNLTPS
?) using and
to achieve the desired result. The assembler equivalent of AllTrue(!(x < y) and !(y < x))
or AllFalse((x < y) or (y > x)
.
背景
有关这个问题的背景是微软的计划到Vector类型添加到.NET。在那里我主张自反 .Equals
方法和需要的这种反思对性能的影响有多大等于在一个IEEE平等将是一个更清晰的画面。见Should 矢量&lt;浮球GT;。.Equals
是反射或它应该遵循IEEE 754语义上programmers.se ,在漫长的故事
The background for this question is Microsoft's plan to add a Vector type to .NET. Where I'm arguing for a reflexive .Equals
method and need a clearer picture of how big the performance impact of this reflexive equals over a IEEE equals would be. See Should Vector<float>.Equals
be reflexive or should it follow IEEE 754 semantics? on programmers.se for the long story.
推荐答案
即使AVX VCMPPS可用(与它大大提高了predicates的选择),这比IEEE比较效率较低。你必须做至少有两个比较并合并结果。这不是太糟糕了,虽然。
Even when AVX VCMPPS is available (with it's greatly enhanced choice of predicates), it's less efficient than IEEE comparison. You have to do at least two compares and combine the results. It's not too bad, though.
-
不同NaN的编码不是等于:有效2个额外的insn(添加2微指令)。无AVX:一个额外的
MOVAPS
超出了
different NaN encodings aren't equal: effectively 2 extra insns (adding 2 uops). Without AVX: One extra
movaps
beyond that.
不同NaN的编码是等于:有效4个额外的insn(添加4微指令)。如果没有AVX:两个额外的 MOVAPS
的insn
different NaN encodings are equal: effectively 4 extra insns (adding 4 uops). Without AVX: Two extra movaps
insn
这是IEEE比较和分支是3微指令: cmpeqps
/ movmskps
/测试和分支。英特尔和AMD宏观熔断测试和转移到一个单一的UOP / M-OP。
An IEEE compare-and-branch is 3 uops: cmpeqps
/ movmskps
/ test-and-branch. Intel and AMD both macro-fuse the test-and-branch into a single uop/m-op.
使用AVX512:逐位楠可能只是一个额外的指令,因为法向量比较和部门可能使用 vcmpEQ_OQps
/ ktest一样的,同样的
/ 江铜
,所以结合两种不同的掩码暂存器是免费的(只是改变args设置为 ktest
)。唯一的费用是额外的 vpcmpeqd K2,XMM0,xmm1中
。
With AVX512: bitwise-NaN is probably just one extra instruction, since normal vector compare and branch probably uses vcmpEQ_OQps
/ ktest same,same
/ jcc
, so combining two different mask regs is free (just change the args to ktest
). The only cost is the extra vpcmpeqd k2, xmm0,xmm1
.
AVX512任何楠只是两个额外的指令(2× VFPCLASSPS
,使用第一作为zeromask的结果:第二个。见下文)。再次,那么 ktest
与两个不同的args设置为设置标志。
AVX512 any-NaN is just two extra instructions (2x VFPCLASSPS
, with the 2nd one using the result of the first as a zeromask. See below). Again, then ktest
with two different args to set flag.
如果我们放弃考虑不同的NaN的编码彼此相等:
If we give up on considering different NaN encodings equal to each other:
- 按位等于捕获两个相同的NaN。
- IEEE等于渔获
+ 0 == -0
情况。
- Bitwise equal catches two identical NaNs.
- IEEE equal catches the
+0 == -0
case.
有任何情况下,无论是对比给人一种假阳性(因为 ieee_equal
是假的,当一个操作数为NaN:我们只想要平等,不等于有或无序。 AVX vcmpps
提供了两个选项,而SSE只提供一个简单的等于操作。)
There are no cases where either compare gives a false positive (since ieee_equal
is false when either operand is NaN: we want just equal, not equal-or-unordered. AVX vcmpps
provides both options, while SSE only provides a plain equal operation.)
我们想知道当所有的元素都是平等的,所以我们应该倒置的比较开始。它更容易检查是否至少一个非零元素比来检查被非零所有元素。 (即水平且硬,水平或容易( PMOVMSKB
/ 测试
或 PTEST
)。以比较的相对意义上是免费的( JNZ
而不是 JZ
) )。这是保罗·R用于同样的伎俩。
We want to know when all elements are equal, so we should start with inverted comparisons. It's easier to check for at least one non-zero element than to check for all elements being non-zero. (i.e. horizontal AND is hard, horizontal OR is easy (pmovmskb
/ test
, or ptest
). Taking the opposite sense of a comparison is free (jnz
instead of jz
).) This is the same trick that Paul R used.
; inputs in xmm0, xmm1
movaps xmm2, xmm0 ; unneeded with 3-operand AVX instructions
cmpneqps xmm2, xmm1 ; 0:A and B are ordered and equal. -1:not ieee_equal. predicate=NEQ_UQ in VEX encoding expanded notation
pcmpeqd xmm0, xmm1 ; -1:bitwise equal 0:otherwise
; xmm0 xmm2
; 0 0 -> equal (ieee_equal only)
; 0 -1 -> unequal (neither)
; -1 0 -> equal (bitwise equal and ieee_equal)
; -1 -1 -> equal (bitwise equal only: only happens when both are NaN)
andnps xmm0, xmm2 ; NOT(xmm0) AND xmm2
; xmm0 elements are -1 where (not bitwise equal) AND (not IEEE equal).
; xmm0 all-zero iff every element was bitwise or IEEE equal, or both
movmskps eax, xmm0
test eax, eax ; it's too bad movmsk doesn't set EFLAGS according to the result
jz no_differences
有关双precision, ... PS
和 pcmpeqQ
将工作一样。
For double-precision, ...PS
and pcmpeqQ
will work the same.
如果在不等于code接着找出哪些元素是不相等,在 movmskps
结果有点扫描会给你的位置的第一个区别。
If the not-equal code goes on to find out which element isn't equal, a bit-scan on the movmskps
result will give you the position of the first difference.
随着SSE4.1 PTEST
您可以替换 andnps
/ movmskps
/带测试和分支:
With SSE4.1 PTEST
you can replace andnps
/movmskps
/test-and-branch with:
ptest xmm0, xmm2 ; CF = 0 == (NOT(xmm0) AND xmm2).
jc no_differences
我希望这是第一次大多数人都见过 PTEST
的 CF
结果是有用的东西。 :)
I expect this is the first time most people have ever seen the CF
result of PTEST
be useful for anything. :)
它仍然是在Intel和AMD的CPU 3微指令((2ptest + 1jcc)VS(pandn + movmsk +融合试验和放大器;分支)),但较少的指令。它的是的更有效的,如果你要 setcc
或 CMOVcc指令
而不是江铜
,因为那些不能宏观保险丝测试
。
It's still three uops on Intel and AMD CPUs ( (2ptest + 1jcc) vs (pandn + movmsk + fused-test&branch)), but fewer instructions. It is more efficient if you're going to setcc
or cmovcc
instead of jcc
, since those can't macro-fuse with test
.
这使得反身比较和分公司共有6个微指令(5与AVX),与3微指令一个IEEE比较和分支。 ( cmpeqps
/ movmskps
/测试和分支)。
That makes a total of 6 uops (5 with AVX) for a reflexive compare-and-branch, vs. 3 uops for an IEEE compare-and-branch. (cmpeqps
/ movmskps
/ test-and-branch.)
PTEST
对AMD推土机系列CPU很高的延迟(的)。他们有两个整数核心共享向量执行单元一个集群。 (这是其替代超线程。)这增加了时间,直到一个分支误predict可以被检测到,或一个数据依赖关系链的延迟( CMOVcc指令
/ setcc
)。
PTEST
has a very high latency on AMD Bulldozer-family CPUs (14c on Steamroller). They have one cluster of vector execution units shared by two integer cores. (This is their alternative to hyperthreading.) This increases the time until a branch mispredict can be detected, or the latency of a data-dependency chain (cmovcc
/ setcc
).
PTEST套 ZF
在 0 ==(XMM0和XMM2)
:如果设置没有任何元素都是 bitwise_equal
键,IEEE(NEQ或无序)。即ZF未设置如有元素是 bitwise_equal
同时还!ieee_equal
。当一对元素包含按位等于这只能发生 NaN的
S(但是当其他元素不相等可能发生)。
PTEST sets ZF
when 0==(xmm0 AND xmm2)
: set if no elements were both bitwise_equal
AND IEEE (neq or unordered). i.e. ZF is unset if any element was bitwise_equal
while also being !ieee_equal
. This can only happen when a pair of elements contain bitwise-equal NaN
s (but can happen when other elements are unequal).
movaps xmm2, xmm0
cmpneqps xmm2, xmm1 ; 0:A and B are ordered and equal.
pcmpeqd xmm0, xmm1 ; -1:bitwise equal
ptest xmm0, xmm2
jc equal_reflexive ; other cases
...
equal_reflexive:
setnz dl ; set if at least one both-nan element
有没有条件测试 CF = 1
以及任何关于 ZF
。 JA
测试 CF = 0且ZF = 1
。这是不可能的,你的只有的仍要测试,所以把一个 JNZ
在 JC
分支目标工作正常。 (如果你也想判断 equal_reflexive
和 at_least_one_nan
,进行不同的设置也许可以适当地设置标志)。
There's no condition that tests CF=1
AND anything about ZF
. ja
tests CF=0 and ZF=1
. It's unlikely that you'd only want to test that anyway, so putting a jnz
in the jc
branch target works fine. (And if you did only want to test equal_reflexive
AND at_least_one_nan
, a different setup could probably set flags appropriately).
这是同样的想法,保罗的r的答案,但有一个bug修正(符合IEEE检查使用AND而非OR结合NaN的检查。)
This is the same idea as Paul R's answer, but with a bugfix (combine NaN check with IEEE check using AND rather than OR.)
; inputs in xmm0, xmm1
movaps xmm2, xmm0
cmpordps xmm2, xmm2 ; find NaNs in A. (0: NaN. -1: anything else). Same as cmpeqps since src and dest are the same.
movaps xmm3, xmm1
cmpordps xmm3, xmm3 ; find NaNs in B
orps xmm2, xmm3 ; 0:A and B are both NaN. -1:anything else
cmpneqps xmm0, xmm1 ; 0:IEEE equal (and ordered). -1:unequal or unordered
; xmm0 AND xmm2 is zero where elements are IEEE equal, or both NaN
; xmm0 xmm2
; 0 0 -> equal (ieee_equal and both NaN (impossible))
; 0 -1 -> equal (ieee_equal)
; -1 0 -> equal (both NaN)
; -1 -1 -> unequal (neither equality condition)
ptest xmm0, xmm2 ; ZF= 0 == (xmm0 AND xmm2). Set if no differences in any element
jz equal_reflexive
; else at least one element was unequal
; alternative to PTEST: andps xmm0, xmm2 / movmskps / test / jz
因此,在这种情况下,我们不需要 PTEST
的 CF
结果毕竟。使用时,我们做 PCMPEQD
,因为它不具有逆(顺便 cmpunordps
的 cmpordps
)。
So in this case we don't need PTEST
's CF
result after all. We do when using PCMPEQD
, because it doesn't have an inverse (the way cmpunordps
has cmpordps
).
9融合域微指令英特尔SNB-系列CPU。 (7 AVX:使用非破坏性3操作数指令,以避免 MOVAPS
)不过,pre-SKYLAKE微架构SNB-系列CPU只能运行 CMPPS
P1上的,所以在这个瓶颈,如果吞吐量是一个问题FP-添加单元。 SKYLAKE微架构运行 CMPPS
P0上/ P1。
9 fused-domain uops for Intel SnB-family CPUs. (7 with AVX: use non-destructive 3-operand instructions to avoid the movaps
.) However, pre-Skylake SnB-family CPUs can only run cmpps
on p1, so this bottlenecks on the FP-add unit if throughput is a concern. Skylake runs cmpps
on p0/p1.
andps
比 PAND
较短的编码,而英特尔的CPU从的Nehalem到Broadwell微架构只能在PORT5运行。这可能需要prevent它从周围的FP code盗窃P0或P1循环。否则 pandn
可能是一个更好的选择。在AMD BD系列, andnps
在IVEC域中运行,无论如何,这样你就不能避免int和FP向量(其中,否则你可能期望如果管理之间的旁路延迟您使用 movmskps
而不是 PTEST
,在这个版本中,只有使用 CMPPS
,而不是 pcmpeqd
)。另请注意,指令排序选择为人类可读性这里。把FP比较(A,B)较早之前, ANDPS
,可能有助于CPU开始在一个周期更快。
andps
has a shorter encoding than pand
, and Intel CPUs from Nehalem to Broadwell can only run it on port5. That may be desirable to prevent it from stealing a p0 or p1 cycle from surrounding FP code. Otherwise pandn
is probably a better choice. On AMD BD-family, andnps
runs in the ivec domain anyway, so you don't avoid the bypass delay between int and FP vectors (which you might otherwise expect to manage if you use movmskps
instead of ptest
, in this version that only uses cmpps
, not pcmpeqd
). Also note that instruction ordering is chosen for human readability here. Putting the FP compare(A,B) earlier, before the ANDPS
, might help the CPU get started on that a cycle sooner.
如果一个操作数被重新使用,它应该有可能再使用它的自NaN的调查结果。新的操作仍然需要自身楠检查和比较的重用操作,所以我们只能救一个 MOVAPS
/ CMPPS
。
If one operand is reused, it should be possible to reuse its self-NaN-finding result. The new operand still needs its self-NaN check, and a compare against the reused operand, so we only save one movaps
/cmpps
.
如果该载体是在存储器中,需要它们中的至少一个被加载有单独的负载的insn。另一种可以只从存储器引用了两次。太差劲了,如果它是不对齐的,但也是有用的。如果操作数为 vcmpps
之一,是已知的载体不会有任何的NaN(如归零寄存器), vcmpunord_qps XMM2,XMM15,[RSI]
将在发现的NaN [RSI]
。
If the vectors are in memory, at least one of them needs to be loaded with a separate load insn. The other one can just be referenced twice from memory. This sucks if it's unaligned or the addressing mode can't micro-fuse, but could be useful. If one of the operands to vcmpps
is a vector known to not have any NaNs (e.g. a zeroed register), vcmpunord_qps xmm2, xmm15, [rsi]
will find NaNs in [rsi]
.
如果我们不希望使用 PTEST
,我们可以通过使用相对比较,但它们与对面的逻辑运算符组合(与相对于得到相同的结果OR)。
If we don't want to use PTEST
, we can get the same result by using the opposite comparisons, but combining them with the opposite logical operator (AND vs. OR).
; inputs in xmm0, xmm1
movaps xmm2, xmm0
cmpunordps xmm2, xmm2 ; find NaNs in A (-1:NaN 0:anything else)
movaps xmm3, xmm1
cmpunordps xmm3, xmm3 ; find NaNs in B
andps xmm2, xmm3 ; xmm2 = (-1:both NaN 0:anything else)
; now in the same boat as before: xmm2 is set for elements we want to consider equal, even though they're not IEEE equal
cmpeqps xmm0, xmm1 ; -1:ieee_equal 0:unordered or unequal
; xmm0 xmm2
; -1 0 -> equal (ieee_equal)
; -1 -1 -> equal (ieee_equal and both NaN (impossible))
; 0 0 -> unequal (neither)
; 0 -1 -> equal (both NaN)
orps xmm0, xmm2 ; 0: unequal. -1:reflexive_equal
movmskps eax, xmm0
test eax, eax
jnz equal_reflexive
其他的想法:未完成的,没有自生能力,打破,或比更糟糕的最上方
真正比较的全1的结果是 NaN的
的编码。 (。或许我们能避免使用 POR
或 PAND
来对每个操作数从 CMPPS
结果组合分开?
Other ideas: unfinished, non-viable, broken, or worse-than-the-above
The all-ones result of a true comparison is an encoding of NaN
. (Try it out. Perhaps we can avoid using POR
or PAND
to combine results from cmpps
on each operand separately?
; inputs in A:xmm0 B:xmm1
movaps xmm2, xmm0
cmpordps xmm2, xmm2 ; find NaNs in A. (0: NaN. -1: anything else). Same as cmpeqps since src and dest are the same.
; cmpunordps wouldn't be useful: NaN stays NaN, while other values are zeroed. (This could be useful if ORPS didn't exist)
; integer -1 (all-ones) is a NaN encoding, but all-zeros is 0.0
cmpunordps xmm2, xmm1
; A:NaN B:0 -> 0 unord 0 -> false
; A:0 B:NaN -> NaN unord NaN -> true
; A:0 B:0 -> NaN unord 0 -> true
; A:NaN B:NaN -> 0 unord NaN -> true
; Desired: 0 where A and B are both NaN.
cmpordps XMM2,将xmm1
刚刚翻起的最终结果为每情况下,具有奇男子出仍然在第一排。
cmpordps xmm2, xmm1
just flips the final result for each case, with the "odd-man-out" still on the 1st row.
我们只能得到我们想要的结果(真当且仅当A和B都是NAN)如果两个输入反转(男 - >非楠,反之亦然)。这意味着我们可以这样做后,使用此想法 cmpordps
为 PAND
替换 cmpordps自我自
在A和B.这是没有用的:即使我们有AVX但不AVX2,我们可以使用 vandps
和 vandnps
(和 vmovmskps
,因为 vptest
是AVX2只)。按位布尔只有单周期延迟,而且不占用矢量-FP-执行添加端口(S),这已是该code的瓶颈。
We can only get the result we want (true iff A and B are both NaN) if both inputs are inverted (NaN -> non-NaN and vice versa). This means we could use this idea for cmpordps
as a replacement for pand
after doing cmpordps self,self
on both A and B. This isn't useful: even if we have AVX but not AVX2, we can use vandps
and vandnps
(and vmovmskps
since vptest
is AVX2 only). Bitwise booleans are only single-cycle latency, and don't tie up the vector-FP-add execution port(s) which is already a bottleneck for this code.
我花了,而用手动。
它可以修改目标元素如果源元素为NaN,但不能上关于DEST元素任何条件的。
It can modify a destination element if a source element is NaN, but that can't be conditional on anything about the dest element.
我希望我能想到的办法 vcmpneqps
,然后每一个源操作数修正内容的结果,一次(的Elid,结合3的结果布尔指令 vcmpps
指令)。现在我相当肯定这是不可能的,因为知道一个操作数是NaN是不够的本身做出更改为 IEEE_equal(A,B)
的结果。
I was hoping I could think of a way to vcmpneqps
and then fixup that result, once with each source operand (to elide the boolean instructions that combine the results of 3 vcmpps
instructions). I'm now fairly sure that's impossible, because knowing that one operand is NaN isn't enough by itself make a change to the IEEE_equal(A,B)
result.
我认为我们可以使用的唯一方法 vfixupimmps
是在每一个源操作数分别检测NaN的,如 vcmpunord_qps
但糟糕的。或作为
真正愚蠢的替代 andps
,检测值为0或全1(NAN)在previous的面具结果进行了比较。
I think the only way we could use vfixupimmps
is for detecting NaNs in each source operand separately, like vcmpunord_qps
but worse. Or as areally stupid replacement for andps
, detecting either 0 or all-ones(NaN) in the mask results of previous compares.
使用AVX512屏蔽寄存器可以帮助相结合的比较结果。大多数AVX512比较指令结果放入一个屏蔽寄存器,而不是矢量章面具矢量,所以我们实际上有无的做事这样,如果我们想在512B区块进行操作。
Using AVX512 mask registers could help combine the results of compares. Most AVX512 compare instructions put the result into a mask register instead of a mask vector in a vector reg, so we actually have to do things this way if we want to operate in 512b chunks.
VFPCLASSPS K2 {} K1,XMM2,将imm8
写到一个屏蔽寄存器,可以被不同的屏蔽寄存器屏蔽。通过设置只将imm8的原来的QNaN与则将SNaN位,我们可以得到那里有一个向量NaN的一种面具。通过设置所有其他位,我们可以得到的逆。
VFPCLASSPS k2 {k1}, xmm2, imm8
writes to a mask register, optionally masked by a different mask register. By setting only the QNaN and SNaN bits of the imm8, we can get a mask of where there are NaNs in a vector. By setting all the other bits, we can get the inverse.
通过使用面膜从A作为零掩码 vfpclassps
上B,我们可以发现只有2说明两个楠位置,而不是通常的CMP / CMP /结合起来。因此,我们保存或
或 ANDN
指令。顺便说一句,我不知道为什么没有OR-NOT操作。也许它出现更少往往比AND-NOT,或者他们只是不想让色情
中的指令集。
By using the mask from A as a zero-mask for the vfpclassps
on B, we can find the both-NaN positions with only 2 instructions, instead of the usual cmp/cmp/combine. So we save an or
or andn
instruction. Incidentally, I wonder why there's no OR-NOT operation. Probably it comes up even less often than AND-NOT, or they just didn't want porn
in the instruction set.
无论YASM也不NASM可以组装这一点,所以我甚至不知道我是否有语法正确的!
Neither yasm nor nasm can assemble this, so I'm not even sure if I have the syntax correct!
; I think this works
; 0x81 = CLASS_QNAN|CLASS_SNAN (first and last bits of the imm8)
VFPCLASSPS k1, zmm0, 0x81 ; k1 = 1:NaN in A. 0:non-NaN
VFPCLASSPS k2{k1}, zmm1, 0x81 ; k2 = 1:NaNs in BOTH
;; where A doesn't have a NaN, k2 will be zero because of the zeromask
;; where B doesn't have a NaN, k2 will be zero because that's the FPCLASS result
;; so k2 is like the bitwise-equal result from pcmpeqd: it's an override for ieee_equal
vcmpNEQ_UQps k3, zmm0, zmm1
;; k3= 0 only where IEEE equal (because of cmpneqps normal operation)
; k2 k3 ; same logic table as the pcmpeqd bitwise-NaN version
; 0 0 -> equal (ieee equal)
; 0 1 -> unequal (neither)
; 1 0 -> equal (ieee equal and both-NaN (impossible))
; 1 1 -> equal (both NaN)
; not(k2) AND k3 is true only when the element is unequal (bitwise and ieee)
KTESTW k2, k3 ; same as PTEST: set CF from 0 == (NOT(k2) AND k2)
jc .reflexive_equal
我们可以重复使用相同的屏蔽寄存器既zeromask和目的地为第二 vfpclassps
的insn,但我用的情况下,我想在评论中区分它们不同的寄存器。这code需要最少两个屏蔽寄存器,但没有多余的向量寄存器。我们也可以使用 K0
而不是 K3
作为目标为 vcmpps $ C $的C>,因为我们并不需要使用它作为predicate,只能作为DEST和src。 (
K0
是不能被用作predicate寄存器,因为该编码装置代替的意思是没有掩蔽。)
We could reuse the same mask register as both zeromask and destination for the 2nd vfpclassps
insn, but I used different registers in case I wanted to distinguish between them in a comment. This code needs a minimum of two mask registers, but no extra vector registers. We could also use k0
instead of k3
as the destination for vcmpps
, since we don't need to use it as a predicate, only as a dest and src. (k0
is the register that can't be used as a predicate, because that encoding means instead means "no masking".)
我不知道我们的可能的创建一个单一的面具 reflexive_equal
导致每个元素,没有一个 ķ...
指令两个口罩在一些点相结合(如 kandnw
而不是 ktestw
)。面具只工作作为零口罩,而不是一个口罩,可以强制结果之一,所以结合 vfpclassps
的结果只能作为一个AND。所以我想我们坚持用1 - 手段 - 包括楠,这是错误的感觉,使用它作为 vcmpps
A zeromask。做 vcmpps
,然后再使用屏蔽寄存器作为目标和predicate为 vfpclassps
,不利于无论是。合并遮蔽,而不是零遮蔽会做的伎俩,但不可写入屏蔽寄存器时。
I'm not sure we could create a single mask with the reflexive_equal
result for each element, without a k...
instruction to combine two masks at some point (e.g. kandnw
instead of ktestw
). Masks only work as zero-masks, not one-masks that can force a result to one, so combining the vfpclassps
results only works as an AND. So I think we're stuck with 1-means-both-NaN, which is the wrong sense for using it as a zeromask with vcmpps
. Doing vcmpps
first, and then using the mask register as destination and predicate for vfpclassps
, doesn't help either. Merge-masking instead of zero-masking would do the trick, but isn't available when writing to a mask register.
;;; Demonstrate that it's hard (probably impossible) to avoid using any k... instructions
vcmpneq_uqps k1, zmm0, zmm1 ; 0:ieee equal 1:unequal or unordered
vfpclassps k2{k1}, zmm0, 0x81 ; 0:ieee equal or A is NaN. 1:unequal
vfpclassps k2{k2}, zmm1, 0x81 ; 0:ieee equal | A is NaN | B is NaN. 1:unequal
;; This is just a slow way to do vcmpneq_Oqps: ordered and unequal.
vfpclassps k3{k1}, zmm0, ~0x81 ; 0:ieee equal or A is not NaN. 1:unequal and A is NaN
vfpclassps k3{k3}, zmm1, ~0x81 ; 0:ieee equal | A is not NaN | B is not NaN. 1:unequal & A is NaN & B is NaN
;; nope, mixes the conditions the wrong way.
;; The bits that remain set don't have any information from vcmpneqps left: both-NaN is always ieee-unequal.
如果 ktest
最终被2微指令像 PTEST
,并且不能宏观保险丝,然后 KMOV EAX,K2
/测试和部门可能会比 ktest K1,K2
/ JCC便宜。希望这将只有一个微指令,因为屏蔽寄存器更像整数寄存器,并可以从一开始就被设计为interally接近的标志。 PTEST
仅在SSE4.1增加,经过设计的许多代与载体之间没有互动和 EFLAGS
。
If ktest
ends up being 2 uops like ptest
, and can't macro-fuse, then kmov eax, k2
/ test-and-branch will probably be cheaper than ktest k1,k2
/ jcc. Hopefully it will only be one uop, since mask registers are more like integer registers, and can be designed from the start to be interally "close" to the flags. ptest
was only added in SSE4.1, after many generations of designs with no interaction between vectors and EFLAGS
.
KMOV
并设置你的POPCNT,BSF和BSR,虽然。 ( BSF
/ 江铜
不宏观导火索,所以在搜索循环,你可能还是会想测试/ JCC,只有当一个非零发现BSF。额外的字节EN code tzcnt不买任何东西,除非你正在做一些网点,因为 BSF
还设置ZF的零输入,即使DEST寄存器是未定义的 lzcnt
给 32 - BSR
,虽然如此,即使你知道的输入为非零它可能是有用的。)
kmov
does set you up for popcnt, bsf or bsr, though. (bsf
/jcc
doesn't macro-fuse, so in a search loop you're probably still going to want to test/jcc and only bsf when a non-zero is found. The extra byte to encode tzcnt doesn't buy you anything unless you're doing something branchless, because bsf
still sets ZF on a zero input, even though the dest register is undefined. lzcnt
gives 32 - bsr
, though, so it can be useful even when you know the input is non-zero.)
我们也可以使用 vcmpEQps
并结合不同我们的研究结果:
We can also use vcmpEQps
and combine our results differently:
VFPCLASSPS k1, zmm0, 0x81 ; k1 = set where there are NaNs in A
VFPCLASSPS k2{k1}, zmm1, 0x81 ; k2 = set where there are NaNs in BOTH
;; where A doesn't have a NaN, k2 will be zero because of the zeromask
;; where B doesn't have a NaN, k2 will be zero because that's the FPCLASS result
vcmpEQ_OQps k3, zmm0, zmm1
;; k3= 1 only where IEEE equal and ordered (cmpeqps normal operation)
; k3 k2
; 1 0 -> equal (ieee equal)
; 1 1 -> equal (ieee equal and both-NaN (impossible))
; 0 0 -> unequal (neither)
; 0 1 -> equal (both NaN)
KORTESTW k3, k2 ; CF = set iff k3|k2 is all-ones.
jc .reflexive_equal
这样,只有当有一个尺寸 kortest
那正是我们的矢量元素的数量相匹配的作品。例如双precision元素的矢量256B只有4个元素,但 kortestb
还是根据输入屏蔽寄存器的低8位设置CF。
This way only works when there's a size of kortest
that exactly matches the number of elements in our vectors. e.g. a 256b vector of double-precision elements only has 4 elements, but kortestb
still sets CF according to the low 8 bits of the input mask registers.
除了为NaN,0 +/-当IEEE_equal是bitwise_equal不同唯一的一次。 (除非我失去了一些东西。仔细检查使用前,这个假设!) + 0
和 -0
都他们的零位,除了 -0
的符号位集(MSB)。
Other than NaN, +/-0 is the only time when IEEE_equal is different from bitwise_equal. (Unless I'm missing something. Double-check this assumption before using!) +0
and -0
have all their bits zero, except that -0
has the sign bit set (the MSB).
如果我们忽略了不同的NaN的编码,然后bitwise_equal就是我们想要的结果,除了在该+/- 0例。 A或B
为0,除了随处可见的符号位当且仅当A和B为+/- 0左移一个让一切归零或不清一色零取决于我们是否需要重写按位相同的测试。
If we ignore different NaN encodings, then bitwise_equal is the result we want, except in the the +/- 0 case. A OR B
will be 0 everywhere except the sign bit iff A and B are +/- 0. A left-shift by one makes it all-zero or not-all-zero for depending on whether or not we need to override the bitwise-equal test.
此使用一个比 cmpneqps
更多的指令,因为我们模仿我们从它需要使用的功能POR
/ paddD
。 (或 PSLLD
接一个,但这是一个字节长。它运行在不同的端口比 pcmpeq
,但你需要考虑周围code的端口分布的因素将决定)。
This uses one more instruction than cmpneqps
, because we're emulating the functionality we need from it with por
/ paddD
. (or pslld
by one, but that's one byte longer. It does run on a different port than pcmpeq
, but you need to consider the the port distribution of the surrounding code to factor that into the decision.)
此算法可能在不同的SIMD架构,没有为NaN的检测提供了相同的矢量FP测试非常有用。
This algorithm might be useful on different SIMD architectures that don't provide the same vector FP tests for detecting NaN.
;inputs in xmm0:A xmm1:B
movaps xmm2, xmm0
pcmpeqd xmm2, xmm1 ; xmm2=bitwise_equal. (0:unequal -1:equal)
por xmm0, xmm1
paddD xmm0, xmm0 ; left-shift by 1 (one byte shorter than pslld xmm0, 1, and can run on more ports).
; xmm0=all-zero only in the +/- 0 case (where A and B are IEEE equal)
; xmm2 xmm0 desired result (0 means "no difference found")
; -1 0 -> 0 ; bitwise equal and +/-0 equal
; -1 non-zero -> 0 ; just bitwise equal
; 0 0 -> 0 ; just +/-0 equal
; 0 non-zero -> non-zero ; neither
ptest xmm2, xmm0 ; CF = ( (not(xmm2) AND xmm0) == 0)
jc reflexive_equal
的延迟时间比上述 cmpneqps
版本较低,由一个或两个周期。
The latency is lower than the cmpneqps
version above, by one or two cycles.
我们真正充分利用 PTEST
这里:利用其ANDN两种不同的操作数之间,并利用其比较 - 对零整个事情。因为我们需要检查所有位,而不仅仅是每个元素的符号位,我们不能用 pandn / movmskps
替换它。
We're really taking full advantage of PTEST
here: Using its ANDN between two different operands, and using its compare-against-zero of the whole thing. We can't replace it with pandn / movmskps
because we need to check all the bits, not just the sign bit of each element.
我没有实际测试过这一点,所以它可能是错的,即使我的结论是+/- 0是唯一的一次IEEE_equal是bitwise_equal不同(比其他的NaN)。
I haven't actually tested this, so it might be wrong even if my conclusion that +/-0 is the only time IEEE_equal is different from bitwise_equal (other than NaNs).
处理不按位相同的NaN整数只OPS可能是不值得的。 是如此的相似+/-天道酬勤,我想不出任何简单的检查将不采取几条指令。天道酬勤拥有所有设定的指数位,全零的尾数。为NaN具有所有设置指数位,具有非零的尾数又名有效数字(所以有有效载荷的23位)。尾数的MSB PTED为 is_quiet
标记来区分信号/提示NaN间$ P $。另请参见英特尔手册VOL1,表4-3(浮点数大和NaN的编码
)。
Handling non-bitwise-identical NaNs with integer-only ops is probably not worth it. The encoding is so similar to +/-Inf that I can't think of any simple checks that wouldn't take several instructions. Inf has all the exponent bits set, and an all-zero mantissa. NaN has all the exponent bits set, with a non-zero mantissa aka significand (so there are 23 bits of payload). The MSB of the mantissa is interpreted as an is_quiet
flag to distinguish signalling / quiet NaNs. Also see Intel manual vol1, table 4-3 (Floating-Point Number and NaN Encodings
).
如果它使用了顶级的9位集编码不是为-Inf,我们可以检查NaN的无符号比较为 A&GT; 0x7f800000
。 ( 0x7f800000
是单precision + Inf文件)。但是请注意, pcmpgtd
/ pcmpgtq
的签署的整数进行比较。 AVX512F VPCMPUD
是一个无符号比较(DEST =屏蔽寄存器)。
If it wasn't for -Inf using the top-9-bits-set encoding, we could check for NaN with an unsigned compare for A > 0x7f800000
. (0x7f800000
is single-precision +Inf). However, note that pcmpgtd
/ pcmpgtq
are signed integer compares. AVX512F VPCMPUD
is an unsigned compare (dest = a mask register).
的OP的建议(A&LT; B)!&放大器;&安培; !(B&LT; A)
不能工作,也不能它的任何变化。你不能告诉一个非数字和两个非数字只是从两个具有反向操作数比较之间的差异。即使混合predicates不禁为否 VCMPPS
predicate区分一个操作数从两个操作数为NaN的是NAN或依赖它是否是第一或第二操作数是NaN的。因此,这是不可能的它们的组合,以有该信息
The OP's suggestion of !(a<b) && !(b<a)
can't work, and neither can any variation of it. You can't tell the difference between one NaN and two NaNs just from two compares with reversed operands. Even mixing predicates can't help: No VCMPPS
predicate differentiates one operand being NaN from both operands being NaN, or depends on whether it's the first or second operand that's NaN. Thus, it's impossible for a combination of them to have that information.
用的比较本身就是一个载体并让我们发现那里有NaN和手动处理他们的保罗的r解决方案。结果从 VCMPPS
两个操作数之间没有结合就足够了,但使用其它操作数比 A
和 B
确实帮助。 (无论是已知的非NaN的载体或相同的操作两次)。
Paul R's solution of comparing a vector with itself does let us detect where there are NaNs and handle them "manually". No combination of results from VCMPPS
between the two operands is sufficient, but using operands other than A
and B
does help. (Either a known-non-NaN vector or same operand twice).
如果没有反转,当至少一个元素等于按位楠code发现。 (有用于 pcmpeqd
不可逆,所以我们不能使用不同的逻辑操作符,仍然可以得到一个测试为全相等):
Without the inversion, the bitwise-NaN code finds when at least one element is equal. (There's no inverse for pcmpeqd
, so we can't use different logical operators and still get a test for all-equal):
; inputs in xmm0, xmm1
movaps xmm2, xmm0
cmpeqps xmm2, xmm1 ; -1:ieee_equal. EQ_OQ predicate in the expanded notation for VEX encoding
pcmpeqd xmm0, xmm1 ; -1:bitwise equal
orps xmm0, xmm2
; xmm0 = -1:(where an element is bitwise or ieee equal) 0:elsewhere
movmskps eax, xmm0
test eax, eax
jnz at_least_one_equal
; else all different
PTEST
是没有用的这种方式,因为相结合或者是唯一有用的东西。
PTEST
isn't useful this way, since combining with OR is the only useful thing.
// UNFINISHED start of an idea
bitdiff = _mm_xor_si128(A, B);
signbitdiff = _mm_srai_epi32(bitdiff, 31); // broadcast the diff in sign bit to the whole vector
signbitdiff = _mm_srli_epi32(bitdiff, 1); // zero the sign bit
something = _mm_and_si128(bitdiff, signbitdiff);
这篇关于浮点相等比较SIMD指令(NaN的== NAN)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!