我正在集群上学习 OpenMPI。这是我的第一个例子。我希望输出会显示来自不同节点的响应,但它们都来自同一个节点 node062。我只是想知道为什么以及如何从不同的节点实际获得报告以显示 MPI 实际上正在将进程分发到不同的节点?感谢致敬!

ex1.c

/* test of MPI */
#include "mpi.h"
#include <stdio.h>
#include <string.h>

int main(int argc, char **argv)
{
char idstr[2232]; char buff[22128];
char processor_name[MPI_MAX_PROCESSOR_NAME];
int numprocs; int myid; int i; int namelen;
MPI_Status stat;

MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
MPI_Get_processor_name(processor_name, &namelen);

if(myid == 0)
{
  printf("WE have %d processors\n", numprocs);
  for(i=1;i<numprocs;i++)
  {
    sprintf(buff, "Hello %d", i);
    MPI_Send(buff, 128, MPI_CHAR, i, 0, MPI_COMM_WORLD); }
    for(i=1;i<numprocs;i++)
    {
      MPI_Recv(buff, 128, MPI_CHAR, i, 0, MPI_COMM_WORLD, &stat);
      printf("%s\n", buff);
    }
}
else
{
  MPI_Recv(buff, 128, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &stat);
  sprintf(idstr, " Processor %d at node %s ", myid, processor_name);
  strcat(buff, idstr);
  strcat(buff, "reporting for duty\n");
  MPI_Send(buff, 128, MPI_CHAR, 0, 0, MPI_COMM_WORLD);
}
MPI_Finalize();

}

ex1.pbs
#!/bin/sh
#
#This is an example script example.sh
#
#These commands set up the Grid Environment for your job:
#PBS -N ex1
#PBS -l nodes=10:ppn=1,walltime=1:10:00
#PBS -q dque

# export OMP_NUM_THREADS=4

 mpirun -np 10 /home/tim/courses/MPI/examples/ex1

编译并运行:
[tim@user1 examples]$ mpicc ./ex1.c -o ex1
[tim@user1 examples]$ qsub ex1.pbs
35540.mgt
[tim@user1 examples]$ nano ex1.o35540
----------------------------------------
Begin PBS Prologue Sat Jan 30 21:28:03 EST 2010 1264904883
Job ID:         35540.mgt
Username:       tim
Group:          Brown
Nodes:          node062 node063 node169 node170 node171 node172 node174 node175
node176 node177
End PBS Prologue Sat Jan 30 21:28:03 EST 2010 1264904883
----------------------------------------
WE have 10 processors
Hello 1 Processor 1 at node node062 reporting for duty
Hello 2 Processor 2 at node node062 reporting for duty
Hello 3 Processor 3 at node node062 reporting for duty
Hello 4 Processor 4 at node node062 reporting for duty
Hello 5 Processor 5 at node node062 reporting for duty
Hello 6 Processor 6 at node node062 reporting for duty
Hello 7 Processor 7 at node node062 reporting for duty
Hello 8 Processor 8 at node node062 reporting for duty
Hello 9 Processor 9 at node node062 reporting for duty

----------------------------------------
Begin PBS Epilogue Sat Jan 30 21:28:11 EST 2010 1264904891
Job ID:         35540.mgt
Username:       tim
Group:          Brown
Job Name:       ex1
Session:        15533
Limits:         neednodes=10:ppn=1,nodes=10:ppn=1,walltime=01:10:00
Resources:      cput=00:00:00,mem=420kb,vmem=8216kb,walltime=00:00:03
Queue:          dque
Account:
Nodes:  node062 node063 node169 node170 node171 node172 node174 node175 node176
node177
Killing leftovers...

End PBS Epilogue Sat Jan 30 21:28:11 EST 2010 1264904891
----------------------------------------

更新:

我想在一个 PBS 脚本中运行多个后台作业,以便这些作业可以同时运行。例如在上面的示例中,我添加了另一个运行 ex1 的调用,并将两个运行更改为 ex1.pbs 中的后台
#!/bin/sh
#
#This is an example script example.sh
#
#These commands set up the Grid Environment for your job:
#PBS -N ex1
#PBS -l nodes=10:ppn=1,walltime=1:10:00
#PBS -q dque

echo "The first job starts!"
mpirun -np 5 --machinefile /home/tim/courses/MPI/examples/machinefile /home/tim/courses/MPI/examples/ex1 &
echo "The first job ends!"
echo "The second job starts!"
mpirun -np 5 --machinefile /home/tim/courses/MPI/examples/machinefile /home/tim/courses/MPI/examples/ex1 &
echo "The second job ends!"

(1) qsub 这个脚本与以前编译的可执行文件 ex1 后结果很好。
The first job starts!
The first job ends!
The second job starts!
The second job ends!
WE have 5 processors
WE have 5 processors
Hello 1 Processor 1 at node node063 reporting for duty
Hello 2 Processor 2 at node node169 reporting for duty
Hello 3 Processor 3 at node node170 reporting for duty
Hello 1 Processor 1 at node node063 reporting for duty
Hello 4 Processor 4 at node node171 reporting for duty
Hello 2 Processor 2 at node node169 reporting for duty
Hello 3 Processor 3 at node node170 reporting for duty
Hello 4 Processor 4 at node node171 reporting for duty

(2) 但是,我觉得ex1的运行时间太快了,可能两个后台job的运行时间重叠的不多,当我把同样的方法应用到我的实际项目中时就不是这样了。所以我在 ex1.c 中加入了 sleep(30) 来延长 ex1 的运行时间,让两个在后台运行 ex1 的作业几乎一直同时运行。
/* test of MPI */
#include "mpi.h"
#include <stdio.h>
#include <string.h>
#include <unistd.h>

int main(int argc, char **argv)
{
char idstr[2232]; char buff[22128];
char processor_name[MPI_MAX_PROCESSOR_NAME];
int numprocs; int myid; int i; int namelen;
MPI_Status stat;

MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
MPI_Get_processor_name(processor_name, &namelen);

if(myid == 0)
{
  printf("WE have %d processors\n", numprocs);
  for(i=1;i<numprocs;i++)
  {
    sprintf(buff, "Hello %d", i);
    MPI_Send(buff, 128, MPI_CHAR, i, 0, MPI_COMM_WORLD); }
    for(i=1;i<numprocs;i++)
    {
      MPI_Recv(buff, 128, MPI_CHAR, i, 0, MPI_COMM_WORLD, &stat);
      printf("%s\n", buff);
    }
}
else
{
  MPI_Recv(buff, 128, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &stat);
  sprintf(idstr, " Processor %d at node %s ", myid, processor_name);
  strcat(buff, idstr);
  strcat(buff, "reporting for duty\n");
  MPI_Send(buff, 128, MPI_CHAR, 0, 0, MPI_COMM_WORLD);
}

sleep(30); // new added to extend the running time
MPI_Finalize();

}

但是再次重新编译和qsub后,结果似乎不太好。有进程中止。
在 ex1.o35571 中:
The first job starts!
The first job ends!
The second job starts!
The second job ends!
WE have 5 processors
WE have 5 processors
Hello 1 Processor 1 at node node063 reporting for duty
Hello 2 Processor 2 at node node169 reporting for duty
Hello 3 Processor 3 at node node170 reporting for duty
Hello 4 Processor 4 at node node171 reporting for duty
Hello 1 Processor 1 at node node063 reporting for duty
Hello 2 Processor 2 at node node169 reporting for duty
Hello 3 Processor 3 at node node170 reporting for duty
Hello 4 Processor 4 at node node171 reporting for duty
4 additional processes aborted (not shown)
4 additional processes aborted (not shown)

在 ex1.e35571 中:
mpirun: killing job...
mpirun noticed that job rank 0 with PID 25376 on node node062 exited on signal 15 (Terminated).
mpirun: killing job...
mpirun noticed that job rank 0 with PID 25377 on node node062 exited on signal 15 (Terminated).

我想知道为什么有进程中止?如何在 PBS 脚本中正确 qsub 后台作业?

最佳答案

几件事:
你需要告诉mpi在哪里启动进程,
假设您正在使用 mpich,请查看 mpiexec 帮助部分并找到机器文件或等效描述。除非提供机器文件,否则它将在一台主机上运行

PBS 自动创建节点文件。它的名称存储在 PBS 命令文件中可用的 PBS_NODEFILE 环境变量中。请尝试以下操作:

mpiexec -machinefile $PBS_NODEFILE ...

如果您使用的是 mpich2,则您有两个使用 mpdboot 启动您的 mpi 运行时。我不记得命令的细节,你必须阅读手册页。请记住创建 secret 文件,否则 mpdboot 将失败。

我再次阅读了您的帖子,您将使用 open mpi,您仍然需要向 mpiexec 命令提供机器文件,但您不必弄乱 mpdboot

关于cluster-computing - 在集群上测试 MPI,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/2170347/

10-12 20:05