在4个__m256d寄存器中找到4个最小值

本文介绍了在4个__m256d寄存器中找到4个最小值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我不知道如何实现:

__m256d min(__m256d A, __m256d B, __m256d C, __m256d D)
{
    __m256d result;

    // result should contain 4 minimal values out of 16 : A[0], A[1], A[2], A[3], B[0], ... , D[3]
    // moreover it should be result[0] <= result[1] <= result[2] <= result[2]

     return result;
}

关于如何聪明地使用_mm256_min_pd，_mm256_max_pd和随机播放/排列的任何想法?

Any ideas of how to use _mm256_min_pd, _mm256_max_pd and shuffles/permutes in a smart way?

================================================ ===

==================================================

这是到目前为止我到达的地方:

This where I got so far, after:

    __m256d T = _mm256_min_pd(A, B);
    __m256d Q = _mm256_max_pd(A, B);
    A = T; B = Q;
    T = _mm256_min_pd(C, D);
    Q = _mm256_max_pd(C, D);
    C = T; D = Q;
    T = _mm256_min_pd(B, C);
    Q = _mm256_max_pd(B, C);
    B = T; C = Q;
    T = _mm256_min_pd(A, B);
    Q = _mm256_max_pd(A, B);
    A = T; D = Q;
    T = _mm256_min_pd(C, D);
    Q = _mm256_max_pd(C, D);
    C = T; D = Q;
    T = _mm256_min_pd(B, C);
    Q = _mm256_max_pd(B, C);
    B = T; C = Q;

我们有: A [0] ＜ B [0]< C [0] ＜ D [0]， A [1]< B [1]< C [1] ＜ D [1]， A [2]< B [2]< C [2] ＜ D [2]， A [3]< B [3]< C [3] ＜ D [3]，

we have : A[0] < B[0] < C[0] < D[0], A[1] < B[1] < C[1] < D[1], A[2] < B[2] < C[2] < D[2], A[3] < B[3] < C[3] < D[3],

因此最小值在A的中间，第二个最小值在A或B的中间，...不确定从那里去哪里...

so the minimal value is among A's, second minimal is among A's or B's, ...Not sure where to go from there ...

================================================ =========

========================================================

第二个想法是问题可以自行解决，但需要2次输入__m256个元素.如果可以做到这一点，那么只需执行min4(A，B)-> P，min4(C，D)-> Q，min4(P，Q)->返回值即可.

Second idea is that the problem is reducible to itself, but with 2 input__m256 elements. If this can be done, then just do min4(A,B) --> P, min4(C,D) --> Q, min4(P,Q) --> return value.

不知道如何对两个向量进行操作:)

No idea how to that for two vectors though :)

================================================ =======================

=======================================================================

更新2:问题几乎解决了-以下函数计算了4个最小值.

Update 2 : problem almost solved -- the following function computes 4 minimal values.

__m256d min4(__m256d A, __m256d B, __m256d C, __m256d D)
{
    __m256d T;
    T = _mm256_min_pd(A, B);
    B = _mm256_max_pd(A, B);
    B = _mm256_permute_pd(B, 0x5);
    A = _mm256_min_pd(T, B);
    B = _mm256_max_pd(T, B);
    B = _mm256_permute2f128_pd(B, B, 0x1);
    T = _mm256_min_pd(A, B);
    B = _mm256_max_pd(A, B);
    B = _mm256_permute_pd(B, 0x5);
    A = _mm256_min_pd(A, B);

    T = _mm256_min_pd(C, D);
    D = _mm256_max_pd(C, D);
    D = _mm256_permute_pd(D, 0x5);
    C = _mm256_min_pd(T, D);
    D = _mm256_max_pd(T, D);
    D = _mm256_permute2f128_pd(D, D, 0x1);
    T = _mm256_min_pd(C, D);
    D = _mm256_max_pd(C, D);
    D = _mm256_permute_pd(D, 0x5);
    C = _mm256_min_pd(C, D);

    T = _mm256_min_pd(A, C);
    C = _mm256_max_pd(A, C);
    C = _mm256_permute_pd(C, 0x5);
    A = _mm256_min_pd(T, C);
    C = _mm256_max_pd(T, C);
    C = _mm256_permute2f128_pd(C, C, 0x1);
    T = _mm256_min_pd(A, C);
    C = _mm256_max_pd(A, C);
    C = _mm256_permute_pd(C, 0x5);
    A = _mm256_min_pd(A, C);

    return A;
};

剩下的就是在返回之前对A中的值进行升序排序.

All that remains is to sort the values in increasing order inside A before return.

推荐答案

最好进行一些SIMD比较以减少到8或4个候选对象(如您现在所拥有的)，然后在向量寄存器中解压缩为标量双精度.这不必涉及内存往返:vextractf128高半部分(_mm256_extractf128_pd)，然后强制转换低半部分.也许使用movhlps(_mm_movehl_ps)来将__m128的上半部分降低到下半部分(尽管在具有AVX的CPU上，您只能保存一个或两个代码字节，而不是使用立即数进行随机排列) ；它的速度不像某些旧的CPU上那样快.

It might be best to do some SIMD compares to reduce to 8 or 4 (like you have now) candidates, and then unpack to scalar doubles in vector registers. This doesn't have to involve a memory round-trip: vextractf128 the high half (_mm256_extractf128_pd), and cast the low half. Maybe use movhlps (_mm_movehl_ps) to get the high half of a __m128 down to the low half (although on CPUs with AVX, you only save a code byte or two from using that instead of a shuffle with an immediate; it's not faster like it is on some old CPUs).

无论是拆开包装还是简单地存放，IDK都是必经之路.可能同时使用这两种方法，以使随机播放端口和存储/加载端口保持忙碌状态会很好.显然，每个向量中的低倍数已经作为标量存在，因此您不必加载它. (而且，编译器很不方便查看标量的存储和重载，以充分利用这一优势，即使是本地数组也是如此.)

IDK whether unpacking with shuffles or simply storing is the way to go. Maybe a mix of both, to keep the shuffle ports and the store/load ports busy would do well. Obviously the low double in every vector is already in place as a scalar, so that's one you don't have to load. (And compilers are bad at seeing through a store-and-reload-as-scalars to take advantage of this, even to a local array.)

即使没有大大缩小候选集的范围，在拆包之前，某些SIMD比较器也可以减少分支标量代码预期的交换/混洗数量，从而减少分支预测错误的代价.

Even without narrowing down the set of candidates very much, some SIMD comparators before unpacking could reduce the amount of swaps / shuffles expected from branchy scalar code, reducing branch mispredict penalties.

正如我在Paul R的答案评论中所描述的那样，在标量代码中，插入排序类型的算法可能会做得很好.但这更像是优先级队列:仅插入前4个元素.如果新的候选人大于现有的最大候选人，那就继续前进.否则，将其插入到按排序顺序维护的4个候选者列表中.

As I described in comments on Paul R's answer, in scalar code you'd probably do well with an insertion-sort type of algorithm. But it's more like a priority queue: only insert into the first 4 elements. If an new candidate is greater than the biggest existing candidate, just move on. Otherwise insertion-sort it into the list of 4 candidates which you maintain in sorted order.

我在SIMD排序网络上找到了非常好的论文，并详细讨论了AVX .当使用SIMDpacked-min/packed-max指令对几个数据向量寄存器进行排序时，它们详细介绍了所需的洗牌.他们甚至在示例中使用诸如_mm512_shuffle_epi32之类的内在函数.他们说，即使在示例中使用了AVX-512掩码寄存器，他们的结果也适用于AVX.

I found a really nice paper on SIMD sorting networks, with specific discussion of AVX. They go into detail about the required shuffles when using SIMD packed-min / packed-max instructions to sort a couple vector-registers of data. They even use intrinsics like _mm512_shuffle_epi32 in their examples. They say their results are applicable to AVX, even though they use AVX-512 mask registers in their examples.

这只是论文的最后一部分，他们谈论合并以将小类别用作大并行类别的构建块.我在任何地方都找不到他们的实际代码，因此也许他们从未发布过基准测试的完整实现以制作图表. :(

It's only the last bit of the paper where they talk about merging to use the small sort as a building block for a big parallel sort. I can't find their actual code anywhere, so maybe they never published the full implementation they benchmarked to make their graphs. :(

顺便说一句，我写了以前的答案，其中包含一些不太好的想法，这些想法关于按成员，但这在这里并不适用，因为我只是在解决处理有效负载(您没有的负载)的复杂性.

BTW, I wrote a previous answer with some not-very-great ideas about sorting 64bit structs by a float member, but that's not really applicable here since I was only addressing the complications of dealing with a payload (which you don't have).

我现在没有时间完成这个答案，所以我只发布我的想法摘要:

I don't have time right now to finish this answer, so I'll just post a summary of my idea:

使该纸张的2寄存器方法适应AVX(或AVX2).我不确定如何最好地模仿其AVX512屏蔽的最小/最大指令. :/我可能稍后再更新.您可能想通过电子邮件发送给作者，并询问他们用来对台式机CPU进行基准测试的代码.

Adapt the 2-register method from that paper to AVX (or AVX2). I'm not sure how best to emulate their AVX512 masked min/max instructions. :/ I may update this later. You might want to email the authors and ask about the code they used to benchmark the desktop CPU.

无论如何，对成对的reg使用2-register函数，将其从4减少至2 reg，然后再次减少至1 reg.与您的版本不同，他们的版本会生成完整排序的输出寄存器.

Anyway, use the 2-register function on pairs of regs, to reduce from 4 to 2 regs, then again to reduce to 1 reg. Unlike your version, theirs produces a fully-sorted output register.

尽可能避免跨车道洗牌可能很棘手.我不确定通过改组使用shufpd(__m256d _mm256_shuffle_pd (__m256d a, __m256d b, const int select);)合并来自两个源reg的数据是否可以获得任何收益. 256b版本可以使用4位imm8而不是2位在每个通道上进行不同的随机播放.

Trying to avoid cross-lane shuffles whenever possible may be tricky. I'm not sure if you can gain anything from using shufpd (__m256d _mm256_shuffle_pd (__m256d a, __m256d b, const int select);) to combine data from two source regs while shuffling. The 256b version can do a different shuffle on each lane, using 4 bits of the imm8 instead of just 2.

这是一个有趣的问题，但不幸的是，我不应该花时间自己编写完整的解决方案.如果有时间，我想比较插入排序优先级队列和相同队列的排序网络完全展开的实现，每个队列分别包含4、8、12和16个元素. (在进入标量之前，不同级别的SIMD分类网络).

It's an interesting problem, but I unfortunately shouldn't take the time to write up a full solution myself. If I had time, I'd like to compare an insertion-sort priority queue and a sorting-network fully-unrolled implementation of the same pqueue, on 4, 8, 12, and 16 elements each. (different levels of SIMD sorting network before going scalar).

我找到的链接:

关于SSE排序的另一篇论文，其中某些代码使用palignr将两个两个单独排序的向量合并为一个排序的8元素向量对.不适用于256b双精度向量.
关于SSE排序的另一篇论文，其中包含来自Core2 CPU的测试结果.由于shufps的限制，它在两个向量之间使用低效的混合/混合/混洗进行混洗.车道内shufpd的限制略有不同. 本文可能值得仔细阅读.它们具有适用于实际SSE向量的算法，并具有可用的随机播放操作.
小陈，罗基和苏达的论文已经链接
看起来像是有关网络排序的一章 CLR算法教科书.
排序网络生成器，并选择算法.不是特定于SIMD的
https://en.wikipedia.org/wiki/Bitonic_sorter 示例图具有16输入分拣网络. Bitonic排序使用可以在某种程度上进行SIMD的模式.网络末端的上部可以省略，因为我们只关心最低4的顺序.
最快的固定长度6个int数组.一个有很多答案的热门问题.

Another paper on SSE sorting, with some code using palignr to merge two two individually sorted vectors into a sorted 8-element pair of vectors. Not directly applicable to 256b vectors of doubles.
Another paper on SSE sorting, with test results from Core2 CPUs. It uses an inefficient blend/blend/shuffle for shuffling between two vectors, because of the limitations of shufps. In-lane shufpd has slightly different restrictions. This paper is probably worth looking at carefully. They have algorithms that work on real SSE vectors, with the available shuffle operations.
Xiaochen, Rocki, and Suda's paper linked already
What looks like a chapter on sorting networks from the CLR algorithms textbook.
Sorting network generator, with a choice of algorithms. Not SIMD-specific
https://en.wikipedia.org/wiki/Bitonic_sorter the example diagram has a 16-input sorting network. Bitonic sorts use patterns that can SIMD to some degree. The upper part of the end of the network can be omitted, since we only care about the order of the lowest 4.
Fastest sort of fixed length 6 int array. A popular question with many answers.

这篇关于在4个__m256d寄存器中找到4个最小值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！