问题描述
使用 MPICH2 库的 CRAY 超级计算机.每个节点有 32 个 CPU.
CRAY supercomputer using the MPICH2 library. Each node has 32 CPU's.
我在 N 个不同的 MPI 等级上有一个浮动,其中每个等级都在不同的节点上.我需要对这组浮点数执行归约操作.对于任何 N 值,我想知道 MPI_Reduce 是否比 MPI_Gather 更快,并且在根上计算了减少.请假设对根等级进行的减少将使用可以利用 N 个线程的良好并行减少算法来完成.
I have a single float on N different MPI ranks, where each of these ranks is on a different node. I need to perform a reduction operation on this group of floats. I would like to know whether an MPI_Reduce is faster than MPI_Gather with the reduction calculated on the root, for any value of N. Please assume that the reduction done on the root rank will be done using a good parallel reduction algorithm that can utilize N threads.
如果 N 的任何值都不是更快,那么对于较小的 N(例如 16)或较大的 N,它是否会趋于正确?
If it isn't faster for any value of N, would it tend to be true for smaller N, like 16, or larger N?
如果是真的,为什么?(例如,MPI_Reduce 是否会使用树通信模式,在它用于与树的下一级通信的方法中倾向于隐藏缩减操作的时间?)
If it is true, why? (For example, will MPI_Reduce use a tree communication pattern that tends to hide the reduction operation's time in the approach it uses to communicate with the next level of the tree?)
推荐答案
假设 MPI_Reduce
总是比 MPI_Gather
+ local reduce 快.
Assume that MPI_Reduce
is always faster than MPI_Gather
+ local reduce.
即使在 N 的情况下,reduce 比gather 慢,MPI 实现也可以轻松地在这种情况下通过gather + local reduce 实现reduce.
Even if there was a case of N where reduction is slower than gather, an MPI implementation could easily implement reduction in this case in terms of gather + local reduce.
MPI_Reduce
只比 MPI_Gather
+ local reduce 有优势:
MPI_Reduce
has only advantages over MPI_Gather
+ local reduce:
MPI_Reduce
是更高级的操作,为实现提供更多优化机会.MPI_Reduce
需要分配更少的内存MPI_Reduce
需要通过同一链接传递更少的数据(如果使用树)或更少的数据(如果使用直接多对一)MPI_Reduce
可以将计算分配到更多资源(例如使用树通信模式)
MPI_Reduce
is the more high-level operation giving the implementation more opportunity to optimize.MPI_Reduce
needs to allocate much less memoryMPI_Reduce
needs to communicate less data (if using a tree) or less data over the same link (if using direct all-to-one)MPI_Reduce
can distribute the computation across more resources (e.g. using a tree communication pattern)
那是说:永远不要对性能做任何假设.测量.
That said:Never assume anything about performance. Measure.
这篇关于MPI_Reduce 与 (MPI_Gather + Reduction on Root) 的性能对比的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!