调用动态数组的共享引用时严重性能损失

调用动态数组的共享引用时严重性能损失

本文介绍了openMp:调用动态数组的共享引用时严重性能损失的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述

限时删除!!

我写了一个cfd模拟,想要并行我的〜10 ^ 5循环(网格大小),这是成员函数的一部分。 openMp代码的实现是直接的:我读取共享数组的条目,使用线程私有数量进行计算,最后再次在共享数组中写入。在每个数组中,我只访问循环数的数组元素,所以我不期望竞争条件,我没有看到任何理由冲洗。测试代码的加速(并行部分),我发现除了一个cpu,所有运行只有〜70%。
有任何人有想法如何改善这个?

  void class :: funcPar(bool parallel){
#pragma omp parallel
{
int one,two,three;
double four,five;

#pragma omp for
for(int b = 0; b< lenAr; b ++){
one = A [b] + B [b]
C [b] = one;
one + = D [b];
E [b] = one;
}
}

}

解决方案

几点,然后测试代码,然后讨论:


  1. 10 ^ 5如果每个项目都是 int ,那就不是那么多了。

  2. 使用OMP时,编译器优化可能会混乱。

  3. 当每组内存处理少量操作时,循环可以是内存绑定(即CPU花费时间等待请求的内存被传递)

如所承诺的,这里是代码:

  #include< iostream& 
#include< chrono>
#include< Eigen / Core>


Eigen :: VectorXi A;
Eigen :: VectorXi B;
Eigen :: VectorXi D;
Eigen :: VectorXi C;
Eigen :: VectorXi E;
int size;

void regular()
{
//#pragma omp parallel
{
int one;
// #pragma omp for
for(int b = 0; b< size; b ++){
one = A [b] + B [b]
C [b] = one;
one + = D [b];
E [b] = one;
}
}
}

void parallel()
{
#pragma omp parallel
{
int一;
#pragma omp for
for(int b = 0; b< size; b ++){
one = A [b] + B [b]
C [b] = one;
one + = D [b];
E [b] = one;
}
}
}

void vectorized()
{
C = A + B;
E = C + D;
}

void both()
{
#pragma omp parallel
{
int tid = omp_get_thread_num
int nthreads = omp_get_num_threads();
int vals = size / nthreads;
int startInd = tid * vals;
if(tid == nthreads - 1)
vals + = size - nthreads * vals;
auto am = Eigen :: Map< Eigen :: VectorXi>(A.data()+ startInd,vals);
auto bm = Eigen :: Map< Eigen :: VectorXi>(B.data()+ startInd,vals);
auto cm = Eigen :: Map< Eigen :: VectorXi>(C.data()+ startInd,vals);
auto dm = Eigen :: Map< Eigen :: VectorXi>(D.data()+ startInd,vals);
auto em = Eigen :: Map< Eigen :: VectorXi>(E.data()+ startInd,vals);
cm = am + bm;
em = cm + dm;
}
}
int main(int argc,char * argv [])
{
srand(time(NULL));
size = 100000;
int iterations = 10;
if(argc> 1)
size = atoi(argv [1]);
if(argc> 2)
iterations = atoi(argv [2]);
std :: cout<< Size:<尺寸< \\\
;
A = Eigen :: VectorXi :: Random(size);
B = Eigen :: VectorXi :: Random(size);
D = Eigen :: VectorXi :: Random(size);
C = Eigen :: VectorXi :: Zero(size);
E = Eigen :: VectorXi :: Zero(size);

auto startReg = std :: chrono :: high_resolution_clock :: now();
for(int i = 0; i< iterations; i ++)
regular();
auto endReg = std :: chrono :: high_resolution_clock :: now();

std :: cerr<< C.sum() - E.sum()<< \\\
;

auto startPar = std :: chrono :: high_resolution_clock :: now();
for(int i = 0; i< iterations; i ++)
parallel();
auto endPar = std :: chrono :: high_resolution_clock :: now();

std :: cerr<< C.sum() - E.sum()<< \\\
;

auto startVec = std :: chrono :: high_resolution_clock :: now();
for(int i = 0; i< iterations; i ++)
vectorized();
auto endVec = std :: chrono :: high_resolution_clock :: now();

std :: cerr<< C.sum() - E.sum()<< \\\
;

auto startPVc = std :: chrono :: high_resolution_clock :: now();
for(int i = 0; i< iterations; i ++)
both();
auto endPVc = std :: chrono :: high_resolution_clock :: now();

std :: cerr<< C.sum() - E.sum()<< \\\
;

std :: cout<< Timings:\\\
;
std :: cout<< Regular:<< std :: chrono :: duration_cast< std :: chrono :: microseconds>(endReg - startReg).count()/ iterations< \\\
;
std :: cout<< Parallel:< std :: chrono :: duration_cast< std :: chrono :: microseconds>(endPar - startPar).count()/ iterations< \\\
;
std :: cout<< Vectorized:< std :: chrono :: duration_cast< std :: chrono :: microseconds>(endVec-startVec).count()/ iterations< \\\
;
std :: cout<< Both:<< std :: chrono :: duration_cast< std :: chrono :: microseconds>(endPVc-startPVc).count()/ iterations< \\\
;

return 0;
}

我使用作为一个向量库来帮助证明一点:优化,我很快就会到达。代码是用四种不同的优化模式编译的:

使用g ++(x86_64-posix-sjlj,由strawberryperl构建。 com



> vs 10 ^ 6元素,平均为100次,无优化。



a 10 ^ 5(无优化):

 时间:
常规:9300
并行:2620
矢量化:2170
两者:910

a 10 ^ 6(无优化):

 时间:
常规:93535
并行:27191
矢量化:21831
两者:8600

矢量化(SIMD)在加速方面胜过OMP。



移至-O1:



10 ^ 5:

 时间:
常规:780
并行:300
矢量化:80
两者: 80

10 ^ 6:

 计时:
常规:7340
并行:2220
矢量化:1830
两者:1670

与没有优化的相同,除了时间要好得多。



跳过-O3: / p>

10 ^ 5:

 计时:
380
平行:130
矢量化:80
两者:70

10 ^ 6:

 时间:
常规:3080
并行:1750
矢量化:1810
两者:1680

对于10 ^ 5,优化仍然胜利。然而,10 ^ 6给出了OMP循环的定时比矢量化更快。



在所有测试中,我们得到了一个x2-x4加速OMP。

注意:我最初运行测试时,我有另一个低优先级进程使用所有的核心。由于某些原因,这会影响主要并行测试,而不影响其他测试。



结论

您的最小代码示例不符合声明的行为。诸如存储器访问模式的问题可能伴随更复杂的数据。添加足够的详细信息以准确地重现您的问题(),以获得更好的帮助。


I am writing a cfd simulation and want to parallelise my ~10^5 loop (lattice size), which is part of a member function. The implementation of the openMp code is straight forward: I read entries of shared arrays, do calculations with thread-private quantities and finally write in a shared array again. In every array I only access the array element of the loop number, so I don't expect a race condition and I don't see any reason to flush. Testing the speedup of the code(the parallel part), I find that all but one cpu run at only ~70%.Has anybody an idea how to improve this?

void class::funcPar(bool parallel){
#pragma omp parallel
{
    int one, two, three;
    double four, five;

    #pragma omp for
    for(int b=0; b<lenAr; b++){
        one = A[b]+B[b];
        C[b] = one;
        one += D[b];
        E[b] = one;
    }
}

}

解决方案

Several points, then test code, then discussion:

  1. 10^5 isn't that much if each item is an int. The incurred overhead of launching multiple threads might be greater than the benefits.
  2. Compiler optimizations can get messed with when using OMP.
  3. When dealing with few operations per set of memory, loops can be memory bound (i.e. the CPU spends time waiting for requested memory to be delivered)

As promised, here's the code:

#include <iostream>
#include <chrono>
#include <Eigen/Core>


Eigen::VectorXi A;
Eigen::VectorXi B;
Eigen::VectorXi D;
Eigen::VectorXi C;
Eigen::VectorXi E;
int size;

void regular()
{
    //#pragma omp parallel
    {
        int one;
//      #pragma omp for
        for(int b=0; b<size; b++){
            one = A[b]+B[b];
            C[b] = one;
            one += D[b];
            E[b] = one;
        }
    }
}

void parallel()
{
#pragma omp parallel
    {
        int one;
        #pragma omp for
        for(int b=0; b<size; b++){
            one = A[b]+B[b];
            C[b] = one;
            one += D[b];
            E[b] = one;
        }
    }
}

void vectorized()
{
    C = A+B;
    E = C+D;
}

void both()
{
    #pragma omp parallel
    {
        int tid = omp_get_thread_num();
        int nthreads = omp_get_num_threads();
        int vals = size / nthreads;
        int startInd = tid * vals;
        if(tid == nthreads - 1)
            vals += size - nthreads * vals;
        auto am = Eigen::Map<Eigen::VectorXi>(A.data() + startInd, vals);
        auto bm = Eigen::Map<Eigen::VectorXi>(B.data() + startInd, vals);
        auto cm = Eigen::Map<Eigen::VectorXi>(C.data() + startInd, vals);
        auto dm = Eigen::Map<Eigen::VectorXi>(D.data() + startInd, vals);
        auto em = Eigen::Map<Eigen::VectorXi>(E.data() + startInd, vals);
        cm = am+bm;
        em = cm+dm;
    }
}
int main(int argc, char* argv[])
{
    srand(time(NULL));
    size = 100000;
    int iterations = 10;
    if(argc > 1)
        size = atoi(argv[1]);
    if(argc > 2)
        iterations = atoi(argv[2]);
    std::cout << "Size: " << size << "\n";
    A = Eigen::VectorXi::Random(size);
    B = Eigen::VectorXi::Random(size);
    D = Eigen::VectorXi::Random(size);
    C = Eigen::VectorXi::Zero(size);
    E = Eigen::VectorXi::Zero(size);

    auto startReg = std::chrono::high_resolution_clock::now();
    for(int i = 0; i < iterations; i++)
        regular();
    auto endReg = std::chrono::high_resolution_clock::now();

    std::cerr << C.sum() - E.sum() << "\n";

    auto startPar = std::chrono::high_resolution_clock::now();
    for(int i = 0; i < iterations; i++)
        parallel();
    auto endPar = std::chrono::high_resolution_clock::now();

    std::cerr << C.sum() - E.sum() << "\n";

    auto startVec = std::chrono::high_resolution_clock::now();
    for(int i = 0; i < iterations; i++)
        vectorized();
    auto endVec = std::chrono::high_resolution_clock::now();

    std::cerr << C.sum() - E.sum() << "\n";

    auto startPVc = std::chrono::high_resolution_clock::now();
    for(int i = 0; i < iterations; i++)
        both();
    auto endPVc = std::chrono::high_resolution_clock::now();

    std::cerr << C.sum() - E.sum() << "\n";

    std::cout << "Timings:\n";
    std::cout << "Regular:    " << std::chrono::duration_cast<std::chrono::microseconds>(endReg - startReg).count() / iterations << "\n";
    std::cout << "Parallel:   " << std::chrono::duration_cast<std::chrono::microseconds>(endPar - startPar).count() / iterations << "\n";
    std::cout << "Vectorized: " << std::chrono::duration_cast<std::chrono::microseconds>(endVec - startVec).count() / iterations << "\n";
    std::cout << "Both      : " << std::chrono::duration_cast<std::chrono::microseconds>(endPVc - startPVc).count() / iterations << "\n";

    return 0;
}

I used Eigen as a vector library to help prove a point re:optimizations, which I'll reach soon. Code was compiled in four different optimization modes:

using g++ (x86_64-posix-sjlj, built by strawberryperl.com project) 4.8.3 under Windows.

Discussion

We'll start by looking at 10^5 vs 10^6 elements, averaged 100 times without optimizations.

a 10^5 (without optimizations):

Timings:
Regular:    9300
Parallel:   2620
Vectorized: 2170
Both      : 910

a 10^6 (without optimizations):

Timings:
Regular:    93535
Parallel:   27191
Vectorized: 21831
Both      : 8600

Vectorization (SIMD) trumps OMP in terms of speedup. Combined, we get even better times.

Moving to -O1:

10^5:

Timings:
Regular:    780
Parallel:   300
Vectorized: 80
Both      : 80

10^6:

Timings:
Regular:    7340
Parallel:   2220
Vectorized: 1830
Both      : 1670

Same as without optimizations except that timings are much better.

Skipping ahead to -O3:

10^5:

Timings:
Regular:    380
Parallel:   130
Vectorized: 80
Both      : 70

10^6:

Timings:
Regular:    3080
Parallel:   1750
Vectorized: 1810
Both      : 1680

For 10^5, optimizations still trump. However, 10^6 gives faster timings for OMP loops than the vectorization.

In all the tests, we got about a x2-x4 speedup for OMP.

Note: I originally ran the tests when I had another low priority process using all the cores. For some reason, this affected mainly the parallel tests, and not the others. Make sure you time things correctly.

Conclusion

Your minimal code example does not behave as claimed. Issues such as memory access patterns can arise with more complex data. Add enough detail to accurately reproduce your problem (MCVE) to get better help.

这篇关于openMp:调用动态数组的共享引用时严重性能损失的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

1403页,肝出来的..

09-06 10:39