我无法通过 Open MPISlurm 下运行 Slurm-script

通常,我能够获取主机名并在我的机器上运行 Open MPI

$ mpirun hostname
myHost
$ cd NPB3.3-SER/ && make ua CLASS=B && mpirun -n 1 bin/ua.B.x inputua.data # Works

但是,如果我通过 slurm-script 执行相同的操作 mpirun hostname 返回空字符串,因此我无法运行 mpirun -n 1 bin/ua.B.x inputua.data

slurm-script.sh:
#!/bin/bash
#SBATCH -o slurm.out        # STDOUT
#SBATCH -e slurm.err        # STDERR
#SBATCH --mail-type=ALL

export LD_LIBRARY_PATH="/usr/lib/openmpi/lib"
mpirun hostname > output.txt # Returns empty
cd NPB3.3-SER/
make ua CLASS=B
mpirun --host myHost -n 1 bin/ua.B.x inputua.data
$ sbatch -N1 slurm-script.sh
Submitted batch job 1

我收到的错误:
There are no allocated resources for the application
  bin/ua.B.x
that match the requested mapping:
------------------------------------------------------------------
Verify that you have mapped the allocated resources properly using the
--host or --hostfile specification.

A daemon (pid unknown) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
------------------------------------------------------------------

最佳答案

如果 Slurm 和 OpenMPI 是最新版本,请确保使用 Slurm 支持编译 OpenMPI(运行 ompi_info | grep slurm 以查找)并且只需在提交脚本中运行 srun bin/ua.B.x inputua.data

或者,mpirun bin/ua.B.x inputua.data 也应该工作。

如果在没有 Slurm 支持的情况下编译 OpenMPI,则以下内容应该可以工作:

srun hostname > output.txt
cd NPB3.3-SER/
make ua CLASS=B
mpirun --hostfile output.txt -n 1 bin/ua.B.x inputua.data

还要确保通过运行 export LD_LIBRARY_PATH="/usr/lib/openmpi/lib" 不会覆盖其他必要的库路径。更好的可能是 export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/usr/lib/openmpi/lib" (或者 a more complex version 如果你想避免一个前导 : 如果它最初是空的。)

关于openmpi - 如何在 Slurm 下运行 Open MPI,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/55255697/

10-09 01:35