问题描述
我正在编写一个C ++程序,该程序对封闭的骑士之旅进行了蛮力搜索.该代码是此处.
I am writing a C++ program that does a brute-force search for closed Knight's tours. The code is here.
我想使用OpenMP对此进行并行化.我的问题是这样做的方式会产生足够的并行度.当前我的代码的相关部分如下
I would like to parallelize this using OpenMP. My problem is to do this in a way that creates a sufficient degree of parallelism. Currently the relevant portion of my code looks like this
#pragma omp parallel for reduction(+:count) if (depth==4)
for (size_t i=0;i<g.neighbours[last].size();i++){
auto n = g.neighbours[last][i];
// See if n can be used to extend or complete the tour
if (depth==4)
是我尝试确保创建的并行任务数量不太多,但另一方面却创建了足够多的并行任务,以使所有处理器保持忙碌状态.设置depth==2
不会更改程序的运行时间.
The if (depth==4)
is my attempt to make sure that not too many parallel tasks are created but on the other hand enough are created to keep all processors busy. Setting depth==2
does not change the runtime of the program.
这似乎没有解决.对于3x12的问题,在我的双核处理器上,OpenMP版本消耗的CPU总时间约为130s,而没有OpenMP的单线程版本占用的CPU时间约为40s.
This does not seem to be working out. For a 3x12 problem, on my dual-core processor the total CPU time consumed by the OpenMP version is around 130s whereas a single-threaded version without OpenMP takes around 40s of CPU time.
对于如何更好地使用OpenMP或为何不适用于此问题的建议,我将不胜感激.
I would appreciate suggestions on how better to use OpenMP or reasons why it is unsuitable for this problem.
更新:感谢@Zulan,我有了一个更新的版本,它使用OpenMP任务的速度要快得多顺序性能以及良好的并行化.
UPDATE: Thanks to @Zulan I have a newer version using OpenMP tasks with a much faster sequential performance as well as good parallelization.
推荐答案
首先,让我们通过使用perf
来快速了解应用程序在哪里花费时间:
First, lets take a super-quick look at where your application spends its time, by using perf
:
perf record ./knights_tour_omp 3 12
perf report
# Overhead Command Shared Object Symbol
# ........ ............... ................... .................................................................................................
#
53.29% knights_tour_om libc-2.19.so [.] malloc
23.16% knights_tour_om libc-2.19.so [.] _int_free
10.31% knights_tour_om libc-2.19.so [.] _int_malloc
4.78% knights_tour_om knights_tour_omp [.] _Z13continue_tourIZ4mainEUlRKSt6vectorIiSaIiEEE_EmRK5GraphS4_RKS0_IcSaIcEEiiRT_.constprop.119
2.64% knights_tour_om libc-2.19.so [.] __memmove_ssse3_back
1.48% knights_tour_om libc-2.19.so [.] malloc_consolidate
您的应用程序花费所有时间分配和释放内存.尽管有报告称 malloc
没有完全锁定,它似乎也不能很好地并行化.
You application spends all it's time allocating and freeing memory. While there are some report that malloc
is is not fully locked, it doesn't seem to parallelize nicely either.
不需要进一步的研究来确定这是由于每次迭代复制visited
和path
向量引起的.幸运的是,DFS具有不错的属性,您不一定需要这样做,您可以重用状态并还原它:
It doesn't need much further investigation to figure out that this is caused by copying the visited
and path
vectors for each iteration. Fortunately DFS has the nice property, that you don't necessarily need to do that, you can reuse the state and restore it:
visited[n] = 1;
path.emplace_back(n);
count += continue_tour(g, path, visited, depth+1,remain-1, cb);
visited[n] = 0;
path.pop_back();
但是,对于并行循环,您需要为每个线程制作工作副本,因此您需要针对这种情况使用原始行为创建单独的路径.
However, for the parallel loop, you need to make working copies for each thread, so you need to make a separate path for this case with the original behavior.
这个小小的改变已经使串行运行时间从22 s降低到2.65 s.再加上2个线程,它下降到2.0 s.有加速,但是效果不是很好-为什么?为了说明这一点,我使用了4个线程(在足够大的系统上).这是用 score-p 记录的一段时间内应用程序的执行情况,显示在吸血鬼.
This small change already brings down the serial runtime from 22 s to 2.65 s. Further with 2 threads it goes down to 2.0 s. There is a speedup, but it's not so good - why? To illustrate that, I used 4 threads (on a large-enough system). Here is the execution of the application over time recorded with score-p, shown in Vampir.
红色是实际工作,蓝色是隐式屏障,这意味着线程处于空闲状态.似乎总是有3个或1个活动线程.原因很简单:板上的大多数位置都有4个或2个邻居,但是其中一个已经被访问过,因此此循环迭代会立即完成.因此,实际工作是在循环的3或1迭代中进行的.这对于2个线程来说是特别糟糕的匹配.具有3个线程,运行时间下降到1.7 s
Red is actual work, blue is a implicit barrier, meaning the threads are idling. There always seem to be either 3 or 1 active thread. The reason is simple: Most of the position on the board have either 4 or 2 neighbors, but one of them is already visited so this loop iteration completes instantly. So the actual work is in either 3 or 1 iteration of the loop. That is a particularly bad match for 2 threads. With 3 threads, the runtime goes down to 1.7 s
注意:在E5-2690 @ 2.90 GHz上使用gcc 5.3时,始终不使用Turbo.没有注意补偿运行时中的差异.
Note: All times with gcc 5.3 on an E5-2690 @ 2.90 GHz no turbo. No care was taken to compensate for variance in runtimes.
考虑到单个循环会为该问题带来非常糟糕的并行性,您可能会想嵌套并行循环.我鼓励您尝试一下,但我认为效果不理想.我认为任务在这种情况下可以很好地工作,因为任务可以产生其他任务.因此,您可以将每个递归步骤作为任务生成,并让OMP运行时确定良好的调度.但是请确保将其限制为一定的depth
,以免任务太短.
Considering that a single loop exposes very bad parallelism for that problem, you might be tempted to nest parallel loops. I encourage you to try it, but I don't think that will work out well. I would think that tasks work well in this context, because tasks can spawn other tasks. So you can spawn each recursive step as a task and let the OMP runtime figure out a good scheduling. But make sure to limit it up to a certain depth
, so that tasks do not get too short.
我想强调使用工具的重要性:尝试在没有性能分析工具的情况下找出性能问题,就像在没有调试器的情况下找出错误一样. perf
在Linux上很容易获得,它的基本形式使用起来非常简单.但是它功能非常强大(尽管高级用法可能会带来一些陷阱).
I want to stress the importance of using tools: Trying to figure out performance issues without performance analysis tools is like figuring out bugs without a debugger. perf
is readily available for Linux and in it's basic form extremely simple to use. Yet it is very powerful (although advanced usage can have some pitfalls).
另外一点:使用OpenMP,通常值得尝试不同的编译器.例如,使用(固定的)任务代码,gcc不会再扩展到4个线程以上,而intel编译器/omp运行时甚至可以提供多达24个线程的加速.
An additional remark: With OpenMP it is often worthwile to try different compilers. For instance with the (fixed) tasking code, gcc does not scale anymore beyond 4 threads, while the intel compiler / omp runtime provides a speedup even up to 24 threads.
这篇关于OpenMP:深度优先搜索的好策略的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!