问题描述
我把一个模拟我所处理的数据结构类型的示例程序放在一起。也就是说,我有 n
对象,我需要在每个可能的对之间重复一次,并执行一个(对称)计算。此操作涉及将数据写入两个对。在串行中,这将采取这样的循环的形式
I put together a sample program that mimics the type of data structure I am dealing with. Namely that I have n
objects, and I need to iterate between each possible pair once and perform a (symmetric) calculation. This operation involves writing data to both pairs. In serial, this would take the form of a loop like this
for(int i = 0; i < N-1; ++i)
for(int j = i + 1; j < N; ++j)
...
然而,它没有在互联网上搜索一个缓存无关的并行实现,我在下面写出和复制。我在这里链接了一个详细描述这个算法的帖子(使用英特尔TBB)。
However, it did not take much searching on the Internet to find a "cache oblivious parallel implementation", which I wrote up and reproduced below. I've linked here a post (which uses Intel TBB) that describes this algorithm in detail.
我尝试使用OpenMP任务执行同样的事情,它总是运行慢于串行对应(只是编译没有-fopenmp)。我用 g ++ -Wall -std = c ++ 11 -O3 test.cpp -o test
编译它。使用或不使用 -O3
;串口总是更快。
I tried using OpenMP tasks to perform the same thing, and it always runs slower than the serial counterpart (simply compiling without -fopenmp). I compile it with g++ -Wall -std=c++11 -O3 test.cpp -o test
. The same is observed with or without -O3
; the serial is always faster.
要添加更多信息,在我的实际应用中,通常有几百到几千个元素(变量 n
在下面的例子),我需要循环在这种成对的方式多次。数百万次。我尝试下面的尝试模拟(虽然我只试着循环10-100k次)。
To add a bit more information, in my real application, there are typically between a few hundred and a few thousand elements (variable n
in the example below) that I need to loop over in this pair-wise fashion many times. Millions of times. My attempt below tries to simulate that (though I only tried looping 10-100k times).
我使用时间。/ test
非常简单,因为这有太大的区别。是的,我知道我的例子写得不好,我包括在我的例子中创建向量所需的时间。但是串行时间给我约30秒,并行一分钟,所以我不认为我需要做任何更严格的事情。
I've timed this very crudely using time ./test
simply because there's so much of a difference. Yes, I know my example is poorly written, and that I am including the time required to create the vector in my example. But time for serial gives me ~30 seconds and over a minute in parallel so I don't think I need to do anything more rigorous just yet.
我的问题:为什么序列号做得更好?我在OpenMP中做了不正确的事情吗?如何正确纠正我的错误?我是否滥用了任务?我有一个感觉,递归任务与它有关,我试着将'OMP_THREAD_LIMIT'设置为4,但它没有产生实质性的区别。有没有更好的方法来实现这个使用OpenMP?
My question is: Why does the serial do better? Have I done something incorrect in OpenMP? How do I properly correct my mistake(s)? Have I misused tasks? I have a feeling that the recursive tasking has something to do with it, and I tried setting 'OMP_THREAD_LIMIT' to 4, but it did not make a substantial difference. Is there a better way of implementing this using OpenMP?
注意:我的问题是具体要求如何修复这个特定实现,适当并行。虽然如果有人知道这个问题的一个替代解决方案,并在OpenMP中正确实现,我也同样开放。
Note: my question is specifically asking how to fix this particular implementation so that it works properly in parallel. Though if anyone knows an alternative solution to this problem and its proper implementation in OpenMP, I am open to that as well.
提前感谢。
#include <vector>
#include <iostream>
std::vector<std::vector<double>> testme;
void rect(int i0, int i1, int j0, int j1)
{
int di = i1 - j0;
int dj = j1 - j0;
constexpr int threshold = 16;
if(di > threshold && dj > threshold)
{
int im = i0 + di/2;
int jm = j0 + dj/2;
#pragma omp task
{ rect(i0, im, j0, jm); }
rect(im, i1, jm, j1);
#pragma omp taskwait
#pragma omp task
{ rect(i0, im, jm, j1); }
rect(im, i1, j0, jm);
#pragma omp taskwait
}
else
{
for(int i = i0; i < i1; ++i)
for(int j = j0; j < j1; ++j) {
testme[i][j] = 1;
testme[j][i] = 1;
}
}
}
void triangle(int n0, int n1)
{
int dn = n1 - n0;
if(dn > 1)
{
int nm = n0 + dn/2;
#pragma omp task
{ triangle(n0, nm); }
triangle(nm, n1);
#pragma omp taskwait
rect(n0, nm, nm, n1);
}
}
void calc_force(int nbodies)
{
#pragma omp parallel num_threads(4)
#pragma omp single
triangle(0, nbodies);
}
int main()
{
int n = 400;
for(int i = 0; i < n; ++i)
{
std::vector<double> temp;
for(int j = 0; j < n; ++j)
temp.push_back(0);
testme.push_back(temp);
}
// Repeat a bunch of times.
for(int i = 0; i < 100000; ++i)
calc_force(n);
return 0;
}
推荐答案
递归算法对于这样的工作负载似乎已经非常奇怪了。然后,使用OpenMP任务并行化似乎甚至更陌生...为什么不使用更传统的方法解决这个问题?
The simple idea of using a recursive algorithm for such a workload seems already very strange to me. And then, to parallelise it using OpenMP tasks seems even stranger... Why not tackling the problem with a more conventional approach?
所以我决定给它一个去我想到的几种方法。但为了使练习变得明智,重要的是要做一些实际工作来计算对称计算,否则,只是迭代所有的元素而不考虑对称方面肯定是最好的选择。
So I decided to give it a go with a few methods that came to my mind. But to make the exercise sensible, it was important that some actual work was done for computing the "symmetric calculation", otherwise, just iterating on a all elements without accounting for the symmetrical aspect would certainly be the best option.
所以我写了一个aa force()
函数计算与两个物体之间的重力相互作用松散相关的一些物体,基于坐标。
然后我测试了四种不同的方法来对粒子进行迭代:
So I wrote a a force()
function computing something loosely related to gravitational interactions between two bodies, based on there coordinates.Then I tested 4 different methods to iterate on the particles:
- 一个天真的三角形方法,由于它是内在的负载不平衡的方面,这一个与
schedule(auto)
子句并行,以允许运行时库采取它认为最好的性能的决定。 -
三角形域的聪明遍历,包括在
j
方向将其切成两半,使用2个常规循环。基本上,它对应于这样的:
- A naïve triangular approach such as the one you proposed. Due to it's intrinsic load-unbalanced aspect, this one is parallelised with a
schedule(auto)
clause to permit the run-time library to take the decision it thinks best for performance. A "clever" traversal of the triangular domain, consisting in cutting it in half in the
j
direction to allow for using 2 regular loops. Basically, it corresponds to to something like this:
。 / |
/ | __ __
/ | => | // |
/ ___ | | // ____ |
由于向量中的力是用 force [i] + = fij; force(j) - = fij;
方法将在非并行化索引中生成更新的竞争条件( j
1),我创建了一个局部线程前力数组,在进入平行区域时初始化为0。然后在该私有数组上进行计算,并且在退出并行区域时,用 critical
构造在全局力数组上累积各个贡献。这是OpenMP中数组的典型减少模式。
Since the vector where the forces are accumulated with a force[i] += fij; force[j] -= fij;
approach would generate race conditions for the updates in the non-parallelised index (j
for example in the method #1), I've created a local pre-thread force array, which is initialised to 0 upon entry to the parallel region. The computations are then done pre-thread on this "private" array, and the individual contributions are accumulated on the global force array with a critical
construct upon exit of the parallel region. This is a typical reduction pattern for arrays in OpenMP.
这是完整的代码:
#include <iostream>
#include <vector>
#include <cstdlib>
#include <cmath>
#include <omp.h>
std::vector< std::vector<double> > l_f;
std::vector<double> x, y, f;
std::vector<int> vi, vj;
int numth, tid;
#pragma omp threadprivate( tid )
double force( int i, int j ) {
double dx = x[i] - x[j];
double dy = y[i] - y[j];
double dist2 = dx*dx + dy*dy;
return dist2 * std::sqrt( dist2 );
}
void loop1( int n ) {
#pragma omp parallel
{
for ( int i = 0; i < n; i++ ) {
l_f[tid][i] = 0;
}
#pragma omp for schedule(auto) nowait
for ( int i = 0; i < n-1; i++ ) {
for ( int j = i+1; j < n; j++ ) {
double fij = force( i, j );
l_f[tid][i] += fij;
l_f[tid][j] -= fij;
}
}
#pragma omp critical
for ( int i = 0; i < n; i++ ) {
f[i] += l_f[tid][i];
}
}
}
void loop2( int n ) {
int m = n/2-1+n%2;
#pragma omp parallel
{
for ( int i = 0; i < n; i++ ) {
l_f[tid][i] = 0;
}
#pragma omp for schedule(static) nowait
for ( int i = 0; i < n; i++ ) {
for ( int j = 0; j < m; j++ ) {
int ii, jj;
if ( j < i ) {
ii = n-1-i;
jj = n-1-j;
}
else {
ii = i;
jj = j+1;
}
double fij = force( ii, jj );
l_f[tid][ii] += fij;
l_f[tid][jj] -= fij;
}
}
if ( n%2 == 0 ) {
#pragma omp for schedule(static) nowait
for ( int i = 0; i < n/2; i++ ) {
double fij = force( i, n/2 );
l_f[tid][i] += fij;
l_f[tid][n/2] -= fij;
}
}
#pragma omp critical
for ( int i = 0; i < n; i++ ) {
f[i] += l_f[tid][i];
}
}
}
void loop3( int n ) {
#pragma omp parallel for schedule(static)
for ( int i = 0; i < n; i++ ) {
for ( int j = 0; j < n; j++ ) {
if ( i < j ) {
f[i] += force( i, j );
}
else if ( i > j ) {
f[i] -= force( i, j );
}
}
}
}
void loop4( int n ) {
#pragma omp parallel
{
for ( int i = 0; i < n; i++ ) {
l_f[tid][i] = 0;
}
#pragma omp for schedule(static) nowait
for ( int k = 0; k < vi.size(); k++ ) {
int i = vi[k];
int j = vj[k];
double fij = force( i, j );
l_f[tid][i] += fij;
l_f[tid][j] -= fij;
}
#pragma omp critical
for ( int i = 0; i < n; i++ ) {
f[i] += l_f[tid][i];
}
}
}
int main( int argc, char *argv[] ) {
if ( argc != 2 ) {
std::cout << "need the dim as argument\n";
return 1;
}
int n = std::atoi( argv[1] );
// initialise data
f.resize( n );
x.resize( n );
y.resize( n );
for ( int i = 0; i < n; ++i ) {
x[i] = y[i] = i;
f[i] = 0;
}
// initialising linearised index vectors
for ( int i = 0; i < n-1; i++ ) {
for ( int j = i+1; j < n; j++ ) {
vi.push_back( i );
vj.push_back( j );
}
}
// initialising the local forces vectors
#pragma omp parallel
{
tid = omp_get_thread_num();
#pragma master
numth = omp_get_num_threads();
}
l_f.resize( numth );
for ( int i = 0; i < numth; i++ ) {
l_f[i].resize( n );
}
// testing all methods one after the other, with a warm up before each
int lmax = 10000;
loop1( n );
double tbeg = omp_get_wtime();
for ( int l = 0; l < lmax; l++ ) {
loop1( n );
}
double tend = omp_get_wtime();
std::cout << "Time for triangular loop is " << tend-tbeg << "s\n";
loop2( n );
tbeg = omp_get_wtime();
for ( int l = 0; l < lmax; l++ ) {
loop2( n );
}
tend = omp_get_wtime();
std::cout << "Time for mangled rectangular loop is " << tend-tbeg << "s\n";
loop3( n );
tbeg = omp_get_wtime();
for ( int l = 0; l < lmax; l++ ) {
loop3( n );
}
tend = omp_get_wtime();
std::cout << "Time for naive rectangular loop is " << tend-tbeg << "s\n";
loop4( n );
tbeg = omp_get_wtime();
for ( int l = 0; l < lmax; l++ ) {
loop4( n );
}
tend = omp_get_wtime();
std::cout << "Time for linearised loop is " << tend-tbeg << "s\n";
int ret = f[n-1];
return ret;
}
现在,评估相对性能和可扩展性变得很简单。
Now, it becomes simple to evaluate there relative performance and scalability.All methods are timed on a loop, after a first non-timed warm-up iteration.
编译: g ++ -O3 -
所有方法在一个循环上计时, mtune = native -march = native -fopenmp tbf.cc -o tbf
8核IvyBridge CPU上的结果:
Results on a 8 cores IvyBridge CPU:
> OMP_NUM_THREADS=1 numactl -N 0 -m 0 ./tbf 500
Time for triangular loop is 9.21198s
Time for mangled rectangular loop is 10.1316s
Time for naive rectangular loop is 15.9408s
Time for linearised loop is 10.6449s
> OMP_NUM_THREADS=2 numactl -N 0 -m 0 ./tbf 500
Time for triangular loop is 6.84671s
Time for mangled rectangular loop is 5.13731s
Time for naive rectangular loop is 8.09542s
Time for linearised loop is 5.4654s
> OMP_NUM_THREADS=4 numactl -N 0 -m 0 ./tbf 500
Time for triangular loop is 4.03016s
Time for mangled rectangular loop is 2.90809s
Time for naive rectangular loop is 4.45373s
Time for linearised loop is 2.7733s
> OMP_NUM_THREADS=8 numactl -N 0 -m 0 ./tbf 500
Time for triangular loop is 2.31051s
Time for mangled rectangular loop is 2.05854s
Time for naive rectangular loop is 3.03463s
Time for linearised loop is 1.7106s
所以在这种情况下,方法#4似乎是最佳选择具有良好的性能和非常好的可扩展性。注意,由于来自 schedule(auto)
指令的一个很好的负载均衡工作,直接的三角形方法并不太差。但最终,我鼓励你用自己的工作量测试...
So in this case, the method #4 seems the best option with both good performance and very good scalability. Notice that the straightforward triangular approach isn't too bad, thanks to a good load-balancing job from the schedule(auto)
directive. But ultimately, I would encourage you to test with your own workload...
为了参考,你的初始代码(修改为计算 force ()
与其他测试完全相同的方式,包括使用的OpenMP线程数,但不需要局部力数组和最终缩减,坦克到递归方法)给出: p>
For reference, your initial code, (modified for computing the force()
the exact same way as for the other tests, including the number of OpenMP threads used, but without the need of local force arrays and final reduction, tanks to the recursive approach) gives this:
> OMP_NUM_THREADS=1 numactl -N 0 -m 0 ./recursive 500
Time for recursive method is 9.32888s
> OMP_NUM_THREADS=2 numactl -N 0 -m 0 ./recursive 500
Time for recursive method is 9.48718s
> OMP_NUM_THREADS=4 numactl -N 0 -m 0 ./recursive 500
Time for recursive method is 10.962s
> OMP_NUM_THREADS=8 numactl -N 0 -m 0 ./recursive 500
Time for recursive method is 13.2786
这篇关于N体算法:为什么这是慢的并行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!