openMp：调用动态数组的共享引用时严重性能损失 | 调用动态数组的共享引用时严重性能损失

本文介绍了openMp：调用动态数组的共享引用时严重性能损失的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我写了一个cfd模拟，想要并行我的〜10 ^ 5循环（网格大小），这是成员函数的一部分。 openMp代码的实现是直接的：我读取共享数组的条目，使用线程私有数量进行计算，最后再次在共享数组中写入。在每个数组中，我只访问循环数的数组元素，所以我不期望竞争条件，我没有看到任何理由冲洗。测试代码的加速（并行部分），我发现除了一个cpu，所有运行只有〜70％。
有任何人有想法如何改善这个？

  void class :: funcPar（bool parallel）{
 #pragma omp parallel 
 {
 int one，two，three; 
 double four，five; 
 
 #pragma omp for 
 for（int b = 0; b< lenAr; b ++）{
 one = A [b] + B [b] 
 C [b] = one; 
 one + = D [b]; 
 E [b] = one; 
} 
}

}

解决方案

几点，然后测试代码，然后讨论：

10 ^ 5如果每个项目都是 int ，那就不是那么多了。

使用OMP时，编译器优化可能会混乱。

当每组内存处理少量操作时，循环可以是内存绑定（即CPU花费时间等待请求的内存被传递）

如所承诺的，这里是代码：

  #include< iostream& 
 #include< chrono> 
 #include< Eigen / Core> 
 
 
 Eigen :: VectorXi A; 
 Eigen :: VectorXi B; 
 Eigen :: VectorXi D; 
 Eigen :: VectorXi C; 
 Eigen :: VectorXi E; 
 int size; 
 
 void regular（）
 {
 //＃pragma omp parallel 
 {
 int one; 
 // #pragma omp for 
 for（int b = 0; b< size; b ++）{
 one = A [b] + B [b] 
 C [b] = one; 
 one + = D [b]; 
 E [b] = one; 
} 
} 
} 
 
 void parallel（）
 {
 #pragma omp parallel 
 {
 int一; 
 #pragma omp for 
 for（int b = 0; b< size; b ++）{
 one = A [b] + B [b] 
 C [b] = one; 
 one + = D [b]; 
 E [b] = one; 
} 
} 
} 
 
 void vectorized（）
 {
 C = A + B; 
 E = C + D; 
} 
 
 void both（）
 {
 #pragma omp parallel 
 {
 int tid = omp_get_thread_num 
 int nthreads = omp_get_num_threads（）; 
 int vals = size / nthreads; 
 int startInd = tid * vals; 
 if（tid == nthreads  -  1）
 vals + = size  -  nthreads * vals; 
 auto am = Eigen :: Map< Eigen :: VectorXi>（A.data（）+ startInd，vals）; 
 auto bm = Eigen :: Map< Eigen :: VectorXi>（B.data（）+ startInd，vals）; 
 auto cm = Eigen :: Map< Eigen :: VectorXi>（C.data（）+ startInd，vals）; 
 auto dm = Eigen :: Map< Eigen :: VectorXi>（D.data（）+ startInd，vals）; 
 auto em = Eigen :: Map< Eigen :: VectorXi>（E.data（）+ startInd，vals）; 
 cm = am + bm; 
 em = cm + dm; 
} 
} 
 int main（int argc，char * argv []）
 {
 srand（time（NULL））; 
 size = 100000; 
 int iterations = 10; 
 if（argc> 1）
 size = atoi（argv [1]）; 
 if（argc> 2）
 iterations = atoi（argv [2]）; 
 std :: cout<< Size：<尺寸< \\\
; 
 A = Eigen :: VectorXi :: Random（size）; 
 B = Eigen :: VectorXi :: Random（size）; 
 D = Eigen :: VectorXi :: Random（size）; 
 C = Eigen :: VectorXi :: Zero（size）; 
 E = Eigen :: VectorXi :: Zero（size）; 
 
 auto startReg = std :: chrono :: high_resolution_clock :: now（）; 
 for（int i = 0; i< iterations; i ++）
 regular（）; 
 auto endReg = std :: chrono :: high_resolution_clock :: now（）; 
 
 std :: cerr<< C.sum（） -  E.sum（）<< \\\
; 
 
 auto startPar = std :: chrono :: high_resolution_clock :: now（）; 
 for（int i = 0; i< iterations; i ++）
 parallel（）; 
 auto endPar = std :: chrono :: high_resolution_clock :: now（）; 
 
 std :: cerr<< C.sum（） -  E.sum（）<< \\\
; 
 
 auto startVec = std :: chrono :: high_resolution_clock :: now（）; 
 for（int i = 0; i< iterations; i ++）
 vectorized（）; 
 auto endVec = std :: chrono :: high_resolution_clock :: now（）; 
 
 std :: cerr<< C.sum（） -  E.sum（）<< \\\
; 
 
 auto startPVc = std :: chrono :: high_resolution_clock :: now（）; 
 for（int i = 0; i< iterations; i ++）
 both（）; 
 auto endPVc = std :: chrono :: high_resolution_clock :: now（）; 
 
 std :: cerr<< C.sum（） -  E.sum（）<< \\\
; 
 
 std :: cout<< Timings：\\\
; 
 std :: cout<< Regular：<< std :: chrono :: duration_cast< std :: chrono :: microseconds>（endReg  -  startReg）.count（）/ iterations< \\\
; 
 std :: cout<< Parallel：< std :: chrono :: duration_cast< std :: chrono :: microseconds>（endPar  -  startPar）.count（）/ iterations< \\\
; 
 std :: cout<< Vectorized：< std :: chrono :: duration_cast< std :: chrono :: microseconds>（endVec-startVec）.count（）/ iterations< \\\
; 
 std :: cout<< Both：<< std :: chrono :: duration_cast< std :: chrono :: microseconds>（endPVc-startPVc）.count（）/ iterations< \\\
; 
 
 return 0; 
}

我使用作为一个向量库来帮助证明一点：优化，我很快就会到达。代码是用四种不同的优化模式编译的：

使用g ++（x86_64-posix-sjlj，由strawberryperl构建。 com

> vs 10 ^ 6元素，平均为100次，无优化。

a 10 ^ 5（无优化）：
时间：常规：9300 并行：2620 矢量化：2170 两者：910
a 10 ^ 6（无优化）：
时间：常规：93535 并行：27191 矢量化：21831 两者：8600
矢量化（SIMD）在加速方面胜过OMP。

移至-O1：

10 ^ 5：
时间：常规：780 并行：300 矢量化：80 两者： 80
10 ^ 6：
计时：常规：7340 并行：2220 矢量化：1830 两者：1670
与没有优化的相同，除了时间要好得多。

跳过-O3： / p>

10 ^ 5：
计时： 380 平行：130 矢量化：80 两者：70
10 ^ 6：
时间：常规：3080 并行：1750 矢量化：1810 两者：1680
对于10 ^ 5，优化仍然胜利。然而，10 ^ 6给出了OMP循环的定时比矢量化更快。

在所有测试中，我们得到了一个x2-x4加速OMP。

注意：我最初运行测试时，我有另一个低优先级进程使用所有的核心。由于某些原因，这会影响主要并行测试，而不影响其他测试。

结论

您的最小代码示例不符合声明的行为。诸如存储器访问模式的问题可能伴随更复杂的数据。添加足够的详细信息以准确地重现您的问题（），以获得更好的帮助。

I am writing a cfd simulation and want to parallelise my ~10^5 loop (lattice size), which is part of a member function. The implementation of the openMp code is straight forward: I read entries of shared arrays, do calculations with thread-private quantities and finally write in a shared array again. In every array I only access the array element of the loop number, so I don't expect a race condition and I don't see any reason to flush. Testing the speedup of the code(the parallel part), I find that all but one cpu run at only ~70%.Has anybody an idea how to improve this?
void class::funcPar(bool parallel){ #pragma omp parallel { int one, two, three; double four, five; #pragma omp for for(int b=0; b<lenAr; b++){ one = A[b]+B[b]; C[b] = one; one += D[b]; E[b] = one; } }
}
解决方案
Several points, then test code, then discussion:
10^5 isn't that much if each item is an int. The incurred overhead of launching multiple threads might be greater than the benefits.
Compiler optimizations can get messed with when using OMP.
When dealing with few operations per set of memory, loops can be memory bound (i.e. the CPU spends time waiting for requested memory to be delivered)
As promised, here's the code:
#include <iostream> #include <chrono> #include <Eigen/Core> Eigen::VectorXi A; Eigen::VectorXi B; Eigen::VectorXi D; Eigen::VectorXi C; Eigen::VectorXi E; int size; void regular() { //#pragma omp parallel { int one; // #pragma omp for for(int b=0; b<size; b++){ one = A[b]+B[b]; C[b] = one; one += D[b]; E[b] = one; } } } void parallel() { #pragma omp parallel { int one; #pragma omp for for(int b=0; b<size; b++){ one = A[b]+B[b]; C[b] = one; one += D[b]; E[b] = one; } } } void vectorized() { C = A+B; E = C+D; } void both() { #pragma omp parallel { int tid = omp_get_thread_num(); int nthreads = omp_get_num_threads(); int vals = size / nthreads; int startInd = tid * vals; if(tid == nthreads - 1) vals += size - nthreads * vals; auto am = Eigen::Map<Eigen::VectorXi>(A.data() + startInd, vals); auto bm = Eigen::Map<Eigen::VectorXi>(B.data() + startInd, vals); auto cm = Eigen::Map<Eigen::VectorXi>(C.data() + startInd, vals); auto dm = Eigen::Map<Eigen::VectorXi>(D.data() + startInd, vals); auto em = Eigen::Map<Eigen::VectorXi>(E.data() + startInd, vals); cm = am+bm; em = cm+dm; } } int main(int argc, char* argv[]) { srand(time(NULL)); size = 100000; int iterations = 10; if(argc > 1) size = atoi(argv[1]); if(argc > 2) iterations = atoi(argv[2]); std::cout << "Size: " << size << "\n"; A = Eigen::VectorXi::Random(size); B = Eigen::VectorXi::Random(size); D = Eigen::VectorXi::Random(size); C = Eigen::VectorXi::Zero(size); E = Eigen::VectorXi::Zero(size); auto startReg = std::chrono::high_resolution_clock::now(); for(int i = 0; i < iterations; i++) regular(); auto endReg = std::chrono::high_resolution_clock::now(); std::cerr << C.sum() - E.sum() << "\n"; auto startPar = std::chrono::high_resolution_clock::now(); for(int i = 0; i < iterations; i++) parallel(); auto endPar = std::chrono::high_resolution_clock::now(); std::cerr << C.sum() - E.sum() << "\n"; auto startVec = std::chrono::high_resolution_clock::now(); for(int i = 0; i < iterations; i++) vectorized(); auto endVec = std::chrono::high_resolution_clock::now(); std::cerr << C.sum() - E.sum() << "\n"; auto startPVc = std::chrono::high_resolution_clock::now(); for(int i = 0; i < iterations; i++) both(); auto endPVc = std::chrono::high_resolution_clock::now(); std::cerr << C.sum() - E.sum() << "\n"; std::cout << "Timings:\n"; std::cout << "Regular: " << std::chrono::duration_cast<std::chrono::microseconds>(endReg - startReg).count() / iterations << "\n"; std::cout << "Parallel: " << std::chrono::duration_cast<std::chrono::microseconds>(endPar - startPar).count() / iterations << "\n"; std::cout << "Vectorized: " << std::chrono::duration_cast<std::chrono::microseconds>(endVec - startVec).count() / iterations << "\n"; std::cout << "Both : " << std::chrono::duration_cast<std::chrono::microseconds>(endPVc - startPVc).count() / iterations << "\n"; return 0; }
I used Eigen as a vector library to help prove a point re:optimizations, which I'll reach soon. Code was compiled in four different optimization modes:
using g++ (x86_64-posix-sjlj, built by strawberryperl.com project) 4.8.3 under Windows.
Discussion
We'll start by looking at 10^5 vs 10^6 elements, averaged 100 times without optimizations.
a 10^5 (without optimizations):
Timings: Regular: 9300 Parallel: 2620 Vectorized: 2170 Both : 910
a 10^6 (without optimizations):
Timings: Regular: 93535 Parallel: 27191 Vectorized: 21831 Both : 8600
Vectorization (SIMD) trumps OMP in terms of speedup. Combined, we get even better times.
Moving to -O1:
10^5:
Timings: Regular: 780 Parallel: 300 Vectorized: 80 Both : 80
10^6:
Timings: Regular: 7340 Parallel: 2220 Vectorized: 1830 Both : 1670
Same as without optimizations except that timings are much better.
Skipping ahead to -O3:
10^5:
Timings: Regular: 380 Parallel: 130 Vectorized: 80 Both : 70
10^6:
Timings: Regular: 3080 Parallel: 1750 Vectorized: 1810 Both : 1680
For 10^5, optimizations still trump. However, 10^6 gives faster timings for OMP loops than the vectorization.
In all the tests, we got about a x2-x4 speedup for OMP.
Note: I originally ran the tests when I had another low priority process using all the cores. For some reason, this affected mainly the parallel tests, and not the others. Make sure you time things correctly.
Conclusion
Your minimal code example does not behave as claimed. Issues such as memory access patterns can arise with more complex data. Add enough detail to accurately reproduce your problem (MCVE) to get better help.

这篇关于openMp：调用动态数组的共享引用时严重性能损失的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

1403页，肝出来的..