问题描述
我在程序中依靠OpenMP并行化和伪随机数生成,但是同时我希望使结果能够完全复制(如果需要的话)(提供相同数量的线程).
I am relying on OpenMP parallelization and pseudo-random number generation in my program but at the same I would like to make the results to be perfectly replicable if desired (provided the same number of threads).
我正在像这样分别为每个线程播种 thread_local
PRNG,
I'm seeding a thread_local
PRNG for each thread separately like this,
{
std::minstd_rand master{};
#pragma omp parallel for ordered
for(int j = 0; j < omp_get_num_threads(); j++)
#pragma omp ordered
global::tl_rng.seed(master());
}
并且我想出了以下方法来产生某些元素的 count
并将它们全部以 deterministic 的顺序放在数组的末尾(结果为首先是线程0,然后是线程1,等等)
and I've come up with the following way of producing count
of some elements and putting them all in an array at the end in a deterministic order (results of thread 0 first, of thread 1 next etc.)
std::vector<Element> all{};
...
#pragma omp parallel if(parallel)
{
std::vector<Element> tmp{};
tmp.reserve(count/omp_get_num_threads() + 1);
// generation loop
#pragma omp for
for(size_t j = 0; j < count; j++)
tmp.push_back(generateElement(global::tl_rng));
// collection loop
#pragma omp for ordered
for(int j = 0; j < omp_get_num_threads(); j++)
#pragma omp ordered
all.insert(all.end(),
std::make_move_iterator(tmp.begin()),
std::make_move_iterator(tmp.end()));
}
问题
这似乎可行,但是我不确定它是否可靠(阅读:可移植).具体来说,例如,如果第二个线程由于其 generateElement()
调用恰好快速返回而尽早完成了其在主循环中的份额,那么从技术上讲,它不会被允许选择第一个迭代收集循环的?在我的编译器中,这不会发生,并且总是线程0执行 j = 0
,线程1执行预期的 j = 1
等.遵循标准吗?还是允许它成为特定于编译器的行为?
The question
This seems to work but I'm not sure if it's reliable (read: portable). Specifically, if, for example, the second thread is done with its share of the main loop early because its generateElement()
calls happened to return quick, won't it technically be allowed to pick the first iteration of the collecting loop? In my compiler that does not happen and it's always thread 0 doing j = 0
, thread 1 doing j = 1
etc. as intended. Does that follow from the standard or is it allowed to be compiler-specific behaviour?
除了如果循环包含 ordered
指令.是否总是保证线程从头开始以增加 thread_num
的方式分割循环?引用来源在哪里这么说?还是我也必须使我的生成"循环为 ordered
–当其中没有 ordered
指令时,它实际上是否有所不同(在性能或逻辑上)?
I could not find much about the ordered
clause in the for
directive except that it is required if the loop contains an ordered
directive inside. Does it always guarantee that the threads will split the loop from the start in increasing thread_num
? Where does it say so in referrable sources? Or do I have to make my "generation" loop ordered
as well – does it actually make difference (performance- or logic-wise) when there's no ordered
directive in it?
请不要根据经验或OpenMP在逻辑上如何实现来回答.我希望得到标准的支持.
Please don't answer by experience, or by how OpenMP would logically be implemented. I'd like to be backed by the standard.
推荐答案
否,当前状态下的代码不可移植.仅当默认循环调度为 static
(即,将迭代空间划分为 count/#threads
个连续的块,然后按顺序分配给线程)时,它才有效.的线程ID,并保证在块和线程ID之间进行映射.但是,OpenMP规范并未规定任何默认时间表,而是将其留给实现方案来选择.许多实现使用 static
,但是不能保证总是如此.
No, the code in its current state is not portable. It will work only if the default loop schedule is static
, that is, the iteration space is divided into count / #threads
contiguous chunks and then assigned to the threads in the order of their thread ID with a guaranteed mapping between chunk and thread ID. But the OpenMP specification does not prescribe any default schedule and leaves it to the implementation to pick one. Many implementations use static
, but that is not guaranteed to always be the case.
如果将 schedule(static)
添加到 all 循环结构中,则 ordered
子句和 ordered
构造将确保线程0将接收第一个迭代块,并且还将是执行 ordered
构造的第一个块.对于在线程数上运行的循环,块大小将为1,即每个线程将仅执行一次迭代,并且并行循环的迭代顺序将与顺序循环的顺序匹配.然后,由 static
计划完成的1:1迭代数与线程ID的映射将确保您所针对的行为.
If you add schedule(static)
to all loop constructs, then the combination of ordered
clause and ordered
construct within each loop body will ensure that thread 0 will receive the the first chunk of iterations and will also be the first one to execute the ordered
construct. For the loops that run over the number of threads, the chunk size will be one, i.e. each thread will execute exactly one iteration and the order of the iterations of the parallel loop will match those of a sequential loop. The 1:1 mapping of iteration number to thread ID done by the static
schedule will then ensure the behaviour you are aiming for.
请注意,如果初始化线程本地PRNG的第一个循环位于不同的并行区域中,则必须确保两个并行区域都以相同数量的线程执行,例如,通过禁用动态团队规模调整( omp_set_dynamic(0);
)或通过显式应用 num_threads
子句.
Note that if the first loop, where you initialise the thread-local PRNGs, is in a different parallel region, you must ensure that both parallel regions execute with the same number of threads, e.g., by disabling dynamic team sizing (omp_set_dynamic(0);
) or by explicit application of the num_threads
clause.
关于 ordered
子句+构造的重要性,它不会影响对线程的迭代分配,但会同步线程并确保物理执行顺序与逻辑执行顺序匹配.没有 ordered
子句的静态调度循环仍会将迭代0分配给线程0,但不能保证其他某个线程不会在线程0之前执行其循环体.在 ordered
构造之外的循环体中,仍然允许并发执行并且不按顺序执行-请参见此处以获得更详细的说明.
As to the significance of the ordered
clause + construct, it does not influence the assignment of iterations to threads, but it synchronises the threads and makes sure that the physical execution order will match the logical one. A statically scheduled loop without an ordered
clause will still assign iteration 0 to thread 0, but there will be no guarantee that some other thread won't execute its loop body ahead of thread 0. Also, any code in the loop body outside of the ordered
construct is still allowed to execute concurrently and out of order - see here for a more detailed explanation.
这篇关于是否订购了OpenMP,是否总是将循环的一部分也按顺序分配给线程?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!