问题描述
我有一个 Python 脚本,用于招募 MPI 进行并行计算.计算方案如下:数据处理第 1 轮 - 进程之间的数据交换 - 数据处理第 2 轮.我有一台 16 逻辑核心机器(2 x Intel Xeon E5520 2.27GHz).由于某种原因,第 1 轮不能并行运行.因此,有 15 个内核处于空闲状态.然而,尽管如此,计算速度还是慢了 2 倍.
I have a python script that recruits MPI for parallel calculations. The scheme of the calculations is following: data processing round 1 - data exchange between processes - data processing round 2. I have a 16 logical core machine (2 x Intel Xeon E5520 2.27GHz). For a reason round 1 cannot be run in parallel. Therefore, 15 cores stay idle. However, despite this fact calculations experience more than 2-fold slowdown.
这个脚本说明了问题(另存为test.py
):
The problem is illustrated by this script (saved as test.py
):
from mpi4py import MPI
import time
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
comm.barrier()
stime = time.time()
if rank == 0:
print('begin calculations at {:.3f}'.format(time.time() - stime))
for i in range(1000000000):
a = 2 * 2
print('end calculations at {:.3f}'.format(time.time() - stime))
comm.bcast(a, root = 0)
print('end data exchange at {:.3f}'.format(time.time() - stime))
else:
a = comm.bcast(root = 0)
当我在 2 个内核上运行它时,我观察到:
When I run it on 2 cores, I observe:
$ mpiexec -n 2 python3 test.py
begin calculations at 0.000
end calculations at 86.954
end data exchange at 86.954
当我在 16 个内核上运行它时,我观察到:
When I run it on 16 cores, I observe:
$ mpiexec -n 16 python3 test.py
begin calculations at 0.000
end calculations at 174.156
end data exchange at 174.157
谁能解释这种差异?一个想法,如何摆脱它,也会很有用.
Can anyone explain such a difference? An Idea, how to get rid of it, would also be useful.
推荐答案
好的,我终于想通了.
有几个因素导致速度变慢:
There are several features contributing to the slowdown:
- 等待数据接收是活动的(它不断检查,如果数据已经到达),这使得等待进程不再空闲.
- 英特尔虚拟内核对计算速度没有贡献.这意味着,8 核机器仍然是 8 核并且表现得像这样,与虚拟机器无关(在某些情况下,例如,当应用
multithreading
模块时,它们可以进行适度的提升,但不是MPI).
- Waiting for data receiving is active (it checks constantly, if data already arrived), which makes waiting processes no more idle.
- Intel virtual cores do not contribute to calculation speed. That means, 8 core machine is still 8 core and behaves like such, irrespective of virtual ones (in some cases, for example, when
multithreading
module is applied, they can make a modest boost, but not with MPI).
考虑到这一点,我修改了代码,将 sleep() 函数引入到等待进程中.结果显示在图表上(每种情况下进行 10 次测量).
Taking this into account, I modified code, introducing the sleep() function into the waiting processes. Results are represented on the chart (10 measurements were done in each case).
from mpi4py import MPI
import time
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
comm.barrier()
stime = time.time()
if rank == 0:
for i in range(1000000000):
a = 2 * 2
print('end calculations at {:.3f}'.format(time.time() - stime))
for i in range(1, size):
comm.send(a, dest = i)
print('end data exchange at {:.3f}'.format(time.time() - stime))
else:
while not comm.Iprobe(source = 0):
time.sleep(1)
a = comm.recv(source = 0)
这篇关于mpi4py:空闲内核显着放缓的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!