本文介绍了mpi4py:空闲内核显着放缓的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 Python 脚本,用于招募 MPI 进行并行计算.计算方案如下:数据处理第 1 轮 - 进程之间的数据交换 - 数据处理第 2 轮.我有一台 16 逻辑核心机器(2 x Intel Xeon E5520 2.27GHz).由于某种原因,第 1 轮不能并行运行.因此,有 15 个内核处于空闲状态.然而,尽管如此,计算速度还是慢了 2 倍.

I have a python script that recruits MPI for parallel calculations. The scheme of the calculations is following: data processing round 1 - data exchange between processes - data processing round 2. I have a 16 logical core machine (2 x Intel Xeon E5520 2.27GHz). For a reason round 1 cannot be run in parallel. Therefore, 15 cores stay idle. However, despite this fact calculations experience more than 2-fold slowdown.

这个脚本说明了问题(另存为test.py):

The problem is illustrated by this script (saved as test.py):

from mpi4py import MPI
import time

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
comm.barrier()
stime = time.time()

if rank == 0:
    print('begin calculations at {:.3f}'.format(time.time() - stime))
    for i in range(1000000000):
        a = 2 * 2
    print('end calculations at {:.3f}'.format(time.time() - stime))
    comm.bcast(a, root = 0)
    print('end data exchange at {:.3f}'.format(time.time() - stime))
else:
    a = comm.bcast(root = 0)

当我在 2 个内核上运行它时,我观察到:

When I run it on 2 cores, I observe:

$ mpiexec -n 2 python3 test.py
begin calculations at 0.000
end calculations at 86.954
end data exchange at 86.954

当我在 16 个内核上运行它时,我观察到:

When I run it on 16 cores, I observe:

$ mpiexec -n 16 python3 test.py
begin calculations at 0.000
end calculations at 174.156
end data exchange at 174.157

谁能解释这种差异?一个想法,如何摆脱它,也会很有用.

Can anyone explain such a difference? An Idea, how to get rid of it, would also be useful.

推荐答案

好的,我终于想通了.

有几个因素导致速度变慢:

There are several features contributing to the slowdown:

  • 等待数据接收是活动的(它不断检查,如果数据已经到达),这使得等待进程不再空闲.
  • 英特尔虚拟内核对计算速度没有贡献.这意味着,8 核机器仍然是 8 核并且表现得像这样,与虚拟机器无关(在某些情况下,例如,当应用 multithreading 模块时,它们可以进行适度的提升,但不是MPI).
  • Waiting for data receiving is active (it checks constantly, if data already arrived), which makes waiting processes no more idle.
  • Intel virtual cores do not contribute to calculation speed. That means, 8 core machine is still 8 core and behaves like such, irrespective of virtual ones (in some cases, for example, when multithreading module is applied, they can make a modest boost, but not with MPI).

考虑到这一点,我修改了代码,将 sleep() 函数引入到等待进程中.结果显示在图表上(每种情况下进行 10 次测量).

Taking this into account, I modified code, introducing the sleep() function into the waiting processes. Results are represented on the chart (10 measurements were done in each case).

from mpi4py import MPI
import time

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
comm.barrier()
stime = time.time()

if rank == 0:
    for i in range(1000000000):
        a = 2 * 2
    print('end calculations at {:.3f}'.format(time.time() - stime))
    for i in range(1, size):
        comm.send(a, dest = i)
    print('end data exchange at {:.3f}'.format(time.time() - stime))
else:
    while not comm.Iprobe(source = 0):
        time.sleep(1)
    a = comm.recv(source = 0)

这篇关于mpi4py:空闲内核显着放缓的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-01 00:32