c++ - 为什么std::for_each比__gnu_parallel::for_each快

我试图了解为什么在以下示例中在单线程上运行的std::for_each比~3快__gnu_parallel::for_each倍：

Time =0.478101 milliseconds

与

Time =0.166421 milliseconds

这是我用来基准测试的代码：

#include <iostream>
#include <chrono>
#include <parallel/algorithm>

//The struct I'm using for timming
struct   TimerAvrg
{
    std::vector<double> times;
    size_t curr=0,n;
    std::chrono::high_resolution_clock::time_point begin,end;
    TimerAvrg(int _n=30)
    {
        n=_n;
        times.reserve(n);
    }

    inline void start()
    {
        begin= std::chrono::high_resolution_clock::now();
    }

    inline void stop()
    {
        end= std::chrono::high_resolution_clock::now();
        double duration=double(std::chrono::duration_cast<std::chrono::microseconds>(end-begin).count())*1e-6;
        if ( times.size()<n)
            times.push_back(duration);
        else{
            times[curr]=duration;
            curr++;
            if (curr>=times.size()) curr=0;}
    }

    double getAvrg()
    {
        double sum=0;
        for(auto t:times)
            sum+=t;
        return sum/double(times.size());
    }
};



int main( int argc, char** argv )
{
    float sum=0;
    for(int alpha = 0; alpha <5000; alpha++)
    {
        TimerAvrg Fps;
        Fps.start();
        std::vector<float> v(1000000);
        std::for_each(v.begin(), v.end(),[](auto v){ v=0;});
        Fps.stop();
        sum = sum + Fps.getAvrg()*1000;
    }

    std::cout << "\rTime =" << sum/5000<< " milliseconds" << std::endl;
    return 0;
}

这是我的配置：

gcc version 7.3.0 (Ubuntu 7.3.0-21ubuntu1~16.04)

Intel® Core™ i7-7600U CPU @ 2.80GHz × 4

htop检查程序是否在单线程或多线程中运行

g++ -std=c++17 -fomit-frame-pointer -Ofast -march=native -ffast-math -mmmx -msse -msse2 -msse3 -DNDEBUG -Wall -fopenmp benchmark.cpp -o benchmark

gcc 8.1.0不会编译相同的代码。我收到该错误消息：

/usr/include/c++/8/tr1/cmath:1163:20: error: ‘__gnu_cxx::conf_hypergf’ has not been declared
   using __gnu_cxx::conf_hypergf;

我已经检查了几个帖子，但是它们很旧或不一样。

我的问题是：

为什么并行速度较慢？

我使用了错误的功能？

在cppreference中，这表示不支持带有Standardization of Parallelism TS的gcc（表中以红色表示），并且我的代码并行运行！

最佳答案

您的函数[](auto v){ v=0;}非常简单。

可以通过单次调用memset替换该函数，也可以将SIMD指令用于单线程并行性。知道它会覆盖与向量最初相同的状态，因此可以优化整个循环。对于优化器来说，替换std::for_each比并行实现要容易得多。

此外，假设并行循环使用线程，则必须记住创建和最终同步（在这种情况下，在处理期间无需同步）会产生开销，这对于您的琐碎操作而言可能是很重要的。

线程并行性通常仅在计算量大的任务上值得。 v=0是其中计算成本最低的操作之一。