在较高的优化级别上，AVX2 simd相对于标量的性能相对较差

本文介绍了在较高的优化级别上，AVX2 simd相对于标量的性能相对较差的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在学习和使用SIMD函数，并编写了一个简单的程序，该程序将其可以在 1秒中运行的向量加法指令的数量与普通标量加法进行了比较. 我发现SIMD在较低的优化级别上表现相对较好，而在较高的优化级别上则始终表现较差，并且我想知道原因，我同时使用了MSVC和gcc，这是同一回事.以下结果来自 Ryzen 7 CPU.我也在 Intel 平台上进行了测试，也几乎是相同的故事.

I am learning and playing with SIMD functions and wrote a simple program, that compares number of vector addition instruction it can run in 1 second compared with normal scalar addition. I found that SIMD performs relatively better at lower optimization level and consistently much worse at higher optimization levels, and I want to know the reason I used both MSVC and gcc, it is the same story. The following result is from Ryzen 7 CPU. I also tested on a Intel platform, pretty much the same story too.

#include <iostream>
#include <numeric>
#include <chrono>
#include <iterator>
#include <thread>
#include <atomic>
#include <vector>
#include <immintrin.h>
int main()
{
    const auto threadLimit = std::thread::hardware_concurrency() - 1; //for running main()
    for (auto i = 1; i <= threadLimit; ++i)
    {
        std::cerr << "Testing " << i << " threads: ";
        std::atomic<unsigned long long> sumScalar {};
        std::atomic<unsigned long long> loopScalar {};
        std::atomic<unsigned long long> sumSimd {};
        std::atomic<unsigned long long> loopSimd {};
        std::atomic_bool stopFlag{ false };
        std::vector<std::thread> threads;
        threads.reserve(i);
        {
            for (auto j = 0; j < i; ++j)
                threads.emplace_back([&]
                    {
                        uint32_t local{};
                        uint32_t loop{};
                        while (!stopFlag)
                        {
                            ++local;
                            ++loop;  //removed this(see EDIT)
                        }
                        sumScalar += local;
                        loopScalar += loop;
                    });
            std::this_thread::sleep_for(std::chrono::seconds{ 1 });
            stopFlag = true;
            for (auto& thread : threads)
                thread.join();
        }
        threads.clear();
        stopFlag = false;
        {
            for (auto j = 0; j < i; ++j)
                threads.emplace_back([&]
                    {
                        const auto oneVec = _mm256_set1_epi32(1);
                        auto local = _mm256_set1_epi32(0);
                        uint32_t inc{};
                        while (!stopFlag)
                        {
                            local = _mm256_add_epi32(oneVec, local);
                            ++inc; //removed this(see EDIT)
                        }
                        sumSimd += std::accumulate(reinterpret_cast<uint32_t*>(&local), reinterpret_cast<uint32_t*>(&local) + 8, uint64_t{});
                        loopSimd += inc;
                    });
            std::this_thread::sleep_for(std::chrono::seconds{ 1 });
            stopFlag = true;
            for (auto& thread : threads)
                thread.join();
        }
        std::cout << "Sum: "<<sumSimd <<" / "<<sumScalar <<"("<<100.0*sumSimd/sumScalar<<"%)\t"<<"Loop: "<<loopSimd<<" / "<<loopScalar<<"("<< 100.0*loopSimd/loopScalar<<"%)\n";
    // SIMD/Scalar, higher value means SIMD better
    }
}

有了g++ -O0 -march=native -lpthread，我得到了:

Testing 1 threads: Sum: 1004405568 / 174344207(576.105%)        Loop: 125550696 / 174344207(72.0131%)
Testing 2 threads: Sum: 2001473960 / 348079929(575.004%)        Loop: 250184245 / 348079929(71.8755%)
Testing 3 threads: Sum: 2991335152 / 521830834(573.238%)        Loop: 373916894 / 521830834(71.6548%)
Testing 4 threads: Sum: 3892119680 / 693704725(561.063%)        Loop: 486514960 / 693704725(70.1329%)
Testing 5 threads: Sum: 4957263080 / 802362140(617.834%)        Loop: 619657885 / 802362140(77.2292%)
Testing 6 threads: Sum: 5417700112 / 953587414(568.139%)        Loop: 677212514 / 953587414(71.0174%)
Testing 7 threads: Sum: 6078496824 / 1067533241(569.396%)       Loop: 759812103 / 1067533241(71.1746%)
Testing 8 threads: Sum: 6679841000 / 1196224828(558.41%)        Loop: 834980125 / 1196224828(69.8013%)
Testing 9 threads: Sum: 7396623960 / 1308004474(565.489%)       Loop: 924577995 / 1308004474(70.6861%)
Testing 10 threads: Sum: 8158849904 / 1416026963(576.179%)      Loop: 1019856238 / 1416026963(72.0224%)
Testing 11 threads: Sum: 8868695984 / 1556964234(569.615%)      Loop: 1108586998 / 1556964234(71.2018%)
Testing 12 threads: Sum: 9441092968 / 1655554694(570.268%)      Loop: 1180136621 / 1655554694(71.2835%)
Testing 13 threads: Sum: 9530295080 / 1689916907(563.951%)      Loop: 1191286885 / 1689916907(70.4938%)
Testing 14 threads: Sum: 10444142536 / 1805583762(578.436%)     Loop: 1305517817 / 1805583762(72.3045%)
Testing 15 threads: Sum: 10834255144 / 1926575218(562.358%)     Loop: 1354281893 / 1926575218(70.2948%)

有了g++ -O3 -march=native -lpthread，我得到了:

Testing 1 threads: Sum: 2933270968 / 3112671000(94.2365%)       Loop: 366658871 / 3112671000(11.7796%)
Testing 2 threads: Sum: 5839842040 / 6177278029(94.5375%)       Loop: 729980255 / 6177278029(11.8172%)
Testing 3 threads: Sum: 8775103584 / 9219587924(95.1789%)       Loop: 1096887948 / 9219587924(11.8974%)
Testing 4 threads: Sum: 11350253944 / 10210948580(111.158%)     Loop: 1418781743 / 10210948580(13.8947%)
Testing 5 threads: Sum: 14487451488 / 14623220822(99.0715%)     Loop: 1810931436 / 14623220822(12.3839%)
Testing 6 threads: Sum: 17141556576 / 14437058094(118.733%)     Loop: 2142694572 / 14437058094(14.8416%)
Testing 7 threads: Sum: 19883362288 / 18313186637(108.574%)     Loop: 2485420286 / 18313186637(13.5718%)
Testing 8 threads: Sum: 22574437968 / 17115166001(131.897%)     Loop: 2821804746 / 17115166001(16.4872%)
Testing 9 threads: Sum: 25356792368 / 18332200070(138.318%)     Loop: 3169599046 / 18332200070(17.2898%)
Testing 10 threads: Sum: 28079398984 / 20747150935(135.341%)    Loop: 3509924873 / 20747150935(16.9176%)
Testing 11 threads: Sum: 30783433560 / 21801526415(141.199%)    Loop: 3847929195 / 21801526415(17.6498%)
Testing 12 threads: Sum: 33420443880 / 22794998080(146.613%)    Loop: 4177555485 / 22794998080(18.3266%)
Testing 13 threads: Sum: 35989535640 / 23596768252(152.519%)    Loop: 4498691955 / 23596768252(19.0649%)
Testing 14 threads: Sum: 38647578408 / 23796083111(162.412%)    Loop: 4830947301 / 23796083111(20.3014%)
Testing 15 threads: Sum: 41148330392 / 24252804239(169.664%)    Loop: 5143541299 / 24252804239(21.208%)

删除loop变量后，在两种情况下仅保留local(请参见代码中的编辑)，结果仍然相同.

After removing the loop variable, leaving just local in both cases (see edit in code), still the same result.

上面的结果是在Ubuntu上使用GCC 9.3.我切换到Windows(mingw)上的GCC 10.2，它显示了很好的缩放比例，请参见下面(结果是原始代码).几乎可以得出结论，这是MSVC和GCC较旧版本的问题吗?

The results above is using GCC 9.3 on Ubuntu. I switched to GCC 10.2 on Windows (mingw), and it shows nice scaling see below (result is the original code). Pretty much can conclude it's MSVC and GCC older version's problem?

Testing 1 threads: Sum: 23752640416 / 3153263747(753.272%)      Loop: 2969080052 / 3153263747(94.159%)
Testing 2 threads: Sum: 46533874656 / 6012052456(774.01%)       Loop: 5816734332 / 6012052456(96.7512%)
Testing 3 threads: Sum: 66076900784 / 9260324764(713.548%)      Loop: 8259612598 / 9260324764(89.1936%)
Testing 4 threads: Sum: 92216030528 / 12229625883(754.038%)     Loop: 11527003816 / 12229625883(94.2548%)
Testing 5 threads: Sum: 111822357864 / 14439219677(774.435%)    Loop: 13977794733 / 14439219677(96.8044%)
Testing 6 threads: Sum: 122858189272 / 17693796489(694.357%)    Loop: 15357273659 / 17693796489(86.7947%)
Testing 7 threads: Sum: 148478021656 / 19618236169(756.837%)    Loop: 18559752707 / 19618236169(94.6046%)
Testing 8 threads: Sum: 156931719736 / 19770409566(793.771%)    Loop: 19616464967 / 19770409566(99.2213%)
Testing 9 threads: Sum: 143331726552 / 20753115024(690.652%)    Loop: 17916465819 / 20753115024(86.3315%)
Testing 10 threads: Sum: 143541178880 / 20331801415(705.993%)   Loop: 17942647360 / 20331801415(88.2492%)
Testing 11 threads: Sum: 160425817888 / 22209102603(722.343%)   Loop: 20053227236 / 22209102603(90.2928%)
Testing 12 threads: Sum: 157095281392 / 23178532051(677.762%)   Loop: 19636910174 / 23178532051(84.7202%)
Testing 13 threads: Sum: 156015224880 / 23818567634(655.015%)   Loop: 19501903110 / 23818567634(81.8769%)
Testing 14 threads: Sum: 145464754912 / 23950304389(607.361%)   Loop: 18183094364 / 23950304389(75.9201%)
Testing 15 threads: Sum: 149279587872 / 23585183977(632.938%)   Loop: 18659948484 / 23585183977(79.1172%)

推荐答案

reinterpret_cast<uint32_t*>(&local)，在循环后，使GCC9在循环内存储/重新加载local ，从而创建了一个存储-转发瓶颈.

reinterpret_cast<uint32_t*>(&local) after the loop is getting GCC9 to store/reload local inside the loop, creating a store-forwarding bottleneck.

此问题已在GCC10中修复； 不要将指针投射到__m256i本地变量上；它也违反了严格混叠，因此即使没有-fno-strict-aliasing，这是不确定的行为 GCC经常使它起作用. (您可以将__m256i*指向任何其他位置类型，反之亦然.)

This is already fixed in GCC10; no need to file a missed-optimization bug. Don't cast pointers onto __m256i locals; it also violates strict-aliasing so it's Undefined Behaviour without -fno-strict-aliasing even though GCC often makes it work. (You can point __m256i* at any other type, but not vice versa.)

gcc9.3(正在使用)正在循环内存储/重新加载向量，但将标量保存在inc eax的寄存器中！

gcc9.3 (which you're using) is storing/reloading your vector inside the loop, but keeping the scalar in a register for inc eax!

向量循环因此会限制向量存储转发加vpaddd的延迟，并且恰好比标量循环慢8倍多.他们的瓶颈无关紧要，接近1倍的总速度只是巧合.

The vector loop thus bottlenecks on the latency of vector store-forwarding plus vpaddd, and that happens to be just over 8x slower than the scalar loop. Their bottlenecks are unrelated, being close to 1x total speed is just coincidence.

(标量循环大概在Zen1或Skylake上以每次迭代1个周期运行，并且7个存储转发的周期加1表示vpaddd听起来是正确的).

(The scalar loop presumably runs at 1 cycle per iteration on Zen1 or Skylake, and 7 cycle store-forwarding plus 1 for vpaddd sounds about right).

这是由reinterpret_cast<uint32_t*>(&local) 间接引起的，这可能是因为GCC试图宽恕严格混叠的未定义行为违规行为，或者仅仅是因为您指向了所有指向本地的指针

It's indirectly caused by reinterpret_cast<uint32_t*>(&local), either because of GCC trying to be forgiving of the strict-aliasing undefined-behaviour violation, or just because you're taking a pointer to the local at all.

这不是正常现象，也不是预期结果，但是内循环内部的原子负载和lambda的结合使GCC9犯了这个错误. (请注意，即使对于标量，GCC9和GCC 10也要从循环内的线程函数arg重新加载stopFlag的地址，因此对于将其保存在寄存器中已经有些失败了.)

This is not normal or expected, but the combination of the atomic load inside the inner loop and maybe the lambda confuse GCC9 into making this mistake. (Note that GCC9 and 10 are reloading the address of stopFlag from the thread function arg inside the loop, even for scalar, so there's already some failure to keep things in registers.)

在正常的用例中，每次检查停止标志都会进行更多的SIMD工作，而且通常不会在迭代中保持向量状态.通常，您会有一个非原子的arg来告诉您要做多少工作，而不是您在内部循环中检查的停止标志.因此，这个错漏的错误很少会成为问题. (除非即使没有原子标记也会发生?)

In normal use-cases, you'll be doing more SIMD work per check of a stop flag, and often you wouldn't be keeping vector state across iterations. And usually you'll have a non-atomic arg that tells you how much work to do, not a stop-flag you check inside the inner loop. So this missed-opt bug is rarely a problem. (Unless it happens even without an atomic flag?)

可再生时，示出 vs. -UUB_TYPEPUN作为源，我在其中使用#ifdef使用您的不安全(并且错过了触发)版本，而从. (该手动hsum在添加之前不会扩展，因此可能会溢出并包装.但这不是重点；在不严格混叠的情况下使用不同的手动shuffle或_mm256_store_si256到单独的数组中，就可以得到所需的结果.不确定的行为.)

Reproducible on Godbolt, showing -DUB_TYPEPUN vs. -UUB_TYPEPUN for source where I used #ifdef to use your unsafe (and missed-opt-triggering) version vs. a safe one with manually-vectorized shuffles from Fastest method to calculate sum of all packed 32-bit integers using AVX512 or AVX2. (That manual hsum doesn't widen before adding so it could overflow and wrap. But that's not the point; using different manual shuffles, or _mm256_store_si256 to a separate array, would be possible to get the result you want without strict-aliasing undefined behaviour.)

标量循环为:

# g++9.3 -O3 -march=znver1
.L5:                                      # do{
        inc     eax                         # local++
.L3:
        mov     rdx, QWORD PTR [rdi+8]      # load the address of stopFlag from the lambda
        movzx   edx, BYTE PTR [rdx]         # zero-extend *&stopFlag into EDX
        test    dl, dl
        je      .L5                       # }while(stopFlag == 0)

带有g ++ 9.3，-O3 -march=znver1的矢量循环，使用您的reinterpret_cast(即我的源代码版本中的-DUB_TYPEPUN):

The vector loop, with g++ 9.3, -O3 -march=znver1, using your reinterpret_cast (i.e. -DUB_TYPEPUN in my version of the source):

# g++9.3 -O3 -march=znver1  with your pointer-cast onto the vector

 # ... ymm1 = _mm256_set1_epi32(1)
.L10:                                               # do {
        vpaddd  ymm1, ymm0, YMMWORD PTR [rsp-32]       # memory-source add with set1(1)
        vmovdqa YMMWORD PTR [rsp-32], ymm1             # store back into stack memory
.L8:
        mov     rax, QWORD PTR [rdi+8]                  # load flag address
        movzx   eax, BYTE PTR [rax]                     # load stopFlag
        test    al, al
        je      .L10                                # }while(stopFlag == 0)

... auto-vectorized hsum, zero-extending elements to 64-bit for vpaddq

但是，由于有一个安全的__m256i水平和，它完全避免了指向local的指针，所以local保留在寄存器中.

But with a safe __m256i horizontal sum that avoids a pointer onto local at all, local stays in a register.

#      ymm1 = _mm256_set1_epi32(1)
.L9:
        vpaddd  ymm0, ymm1, ymm0             # local += set1(1),  staying in a register, ymm0
.L8:
        mov     rax, QWORD PTR [rdi+8]       # same loop overhead, still 3 uops (with fusion of test/je)
        movzx   eax, BYTE PTR [rax]
        test    al, al
        je      .L9

... manually-vectorized 32-bit hsum

在我的Intel Skylake i7-6700k上，使用g ++ 10.1 -O3 -march = skylake，Arch GNU/Linux，energy_performance_preference = balance_power(最大时钟= 3.9)，对于每个线程数量，我都能获得预期的800 +-1％ GHz且任何数量的核心都处于活动状态.)

On my Intel Skylake, i7-6700k, I get the expected 800 +- 1% for every number of threads, with g++ 10.1 -O3 -march=skylake, Arch GNU/Linux, energy_performance_preference=balance_power (max clocks = 3.9GHz with any # of cores active).

标量循环和向量循环具有相同的uops数量，并且没有不同的瓶颈，因此它们以相同的周期/迭代运行. (4，如果它可以使那些地址->停止标志负载的值链在飞行中，则可能每个周期以1次迭代运行).

Scalar and vector loops having the same number of uops and no different bottlenecks, so they run at identical cycles / iteration. (4, perhaps running at 1 iteration per cycle if it can keep those address -> value chains of stopflag loads in flight).

Zen1可能会有所不同，因为vpaddd ymm为2 oups.但是它的前端足够宽，可能每次迭代仍以1个周期运行该循环，因此您在那里也可能会看到800％.

Zen1 could be different because vpaddd ymm is 2 uops. But its front-end is wide enough to probably still run that loop at 1 cycle per iteration so you might see 800% there, too.

在没有注释的情况下，我获得〜267％的"SIMD速度".在SIMD循环中增加一个inc，它会变成5微妙，并且可能会对Skylake产生一些讨厌的前端影响.

With ++loop uncommented, I get ~267% "SIMD speed". With an extra inc in the SIMD loop, it becomes 5 uops, and probably suffers from some nasty front-end effect on Skylake.

-O0基准测试通常是没有意义的，它具有不同的瓶颈(通常是通过将所有内容保存在内存中来进行存储/重新加载)，而SIMD内部函数在-O0处通常会有很多额外的开销.尽管在这种情况下，甚至-O3都在SIMD循环的存储/重新加载上遇到了瓶颈.

-O0 benchmarking is meaningless in general, it has different bottlenecks (usually store/reload from keeping everything in memory), and SIMD intrinsics usually have a lot of extra overhead at -O0. Although in this case, even -O3 was bottlenecking on store/reload for the SIMD loop.

这篇关于在较高的优化级别上，AVX2 simd相对于标量的性能相对较差的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

avx2