问题描述
我正在测试MPI I / O。
子程序save_vtk
整数:: filetype,fh,单位
integer(MPI_OFFSET_KIND):: pos
real(RP),allocatable :: buffer(::::
integer :: ie
if(master)然后
打开(newunit = unit,file =out.vtk,&
access ='stream',status ='replace',form =unformatted,action =write)
!写头b $ b close(unit)
end if
call MPI_Barrier(mpi_comm,ie)
call MPI_File_open(mpi_comm,out.vtk MPI_MODE_APPEND + MPI_MODE_WRONLY,MPI_INFO_NULL,fh,即)
调用MPI_Type_create_subarray(3,int(ng),int(nxyz),int(off),&
MPI_ORDER_FORTRAN,MPI_RP,filetype ,即)
调用MPI_type_commit(文件类型,即)
调用MPI_Barrier(mpi_comm,即)
调用MPI_File_get_position(fh,pos,即)
调用MPI_Barrier(mpi_comm,即)
调用MPI_File_set_view(fh,pos,MPI_RP,filetype,native,MPI_INFO_NULL,即)
buffer = BigEnd(Phi(1: (fh,buffer,nx * ny * nz,MPI_RP,MPI_STATUS_IGNORE,即)
call MPI_File_close(fh,nx,1:ny,1:nz))
call MPI_File_close ,即)
结束子程序
未定义变量来自主机关联,删除了一些错误检查。在全国学术丛集上运行时出现此错误:
*** MPI_Isend发生错误
* **由进程[3941400577,18036219417246826496]
***在通信器MPI COMMUNICATOR 20 DUP从0
*** MPI_ERR_BUFFER:无效缓冲区指针
*** MPI_ERRORS_ARE_FATAL(进程在此通信程序现在会中止,
***和潜在的你的MPI工作)
错误被触发通过调用 MPI_File_write_all
。我怀疑它可能与缓冲区的大小有关,这个缓冲区的大小是 10 ^ 5
到 10 ^ 6
。但我无法排除我身边的编程错误,因为我之前没有使用MPI I / O的经验。
使用的MPI实现是 OpenMPI 1.8.0
与Intel Fortran 14.0.2。
你知道如何使它工作并写入文件吗?
--- Edit2 ---
测试简化版本时,重要代码保持不变,完整源代码为。注意它与gfortran协同工作,并且与英特尔不同的MPI失败。我无法用PGI编译它。另外我错了,它只会在不同的节点上失败,即使在单个进程运行中也是如此。
> module ad gcc -4.8.1
>模块广告openmpi-1.8.0-gcc
> mpif90 save.f90
> ./ a.out
尝试在1 1中分解1个过程网格。
> mpirun a.out
尝试在1 1 2个过程网格中分解。
>模块rm openmpi-1.8.0-gcc
>模块广告openmpi-1.8.0-intel
> mpif90 save.f90
> ./a.out
尝试在1 1 1过程网格中分解。
ERROR write_all
MPI_ERR_IO:输入/输出错误
>模块rm openmpi-1.8.0-intel
>模块ad openmpi-1.6-intel
> mpif90 save.f90
> ./ a.out
尝试在1 1 1进程网格中进行分解。
错误write_all
MPI_ERR_IO:输入/输出错误
[luna24.fzu.cz:24260] *** MPI_File_set_errhandler
[luna24.fzu.cz:24260] ***关于NULL沟通
[luna24.fzu.cz:24260] ***未知错误
[luna24.fzu.cz:24260] *** MPI_ERRORS_ARE_FATAL:您的MPI作业现在将中止
----------------------------------- ---------------------------------------
MPI进程正在中止无法保证其工作中的所有同业流程的所有
都会被正确处理。你应该
仔细检查一切已经完全关闭。
原因:MPI_FINALIZE被调用后
本地主机:luna24.fzu.cz
PID:24260
------------- -------------------------------------------------- -----------
>模块rm openmpi-1.6-intel
>模块广告mpich2-intel
> mpif90 save.f90
> ; ./a.out
尝试在1 1 1过程网格中分解。
错误write_all
其他I / O错误,错误堆栈:
ADIOI_NFS_WRITECONTIG(70):其他I / O错误错误
地址
code> buffer = BigEnd(Phi(1:nx,1:ny,1:nz))
根据Fortran 2003标准(不在Fortran 95中),数组buffer
应自动分配到右侧的形状。英特尔Fortran版本14在默认情况下不会这样做,它需要选项-assume realloc_lhs
来做到这一点。这个选项包含在其他选项中
-standard-semantics
因为当问题中的代码被测试时,这个选项没有生效,程序访问了一个未分配的数组和导致崩溃的未定义行为。
I am testing MPI I/O.
subroutine save_vtk integer :: filetype, fh, unit integer(MPI_OFFSET_KIND) :: pos real(RP),allocatable :: buffer(:,:,:) integer :: ie if (master) then open(newunit=unit,file="out.vtk", & access='stream',status='replace',form="unformatted",action="write") ! write the header close(unit) end if call MPI_Barrier(mpi_comm,ie) call MPI_File_open(mpi_comm,"out.vtk", MPI_MODE_APPEND + MPI_MODE_WRONLY, MPI_INFO_NULL, fh, ie) call MPI_Type_create_subarray(3, int(ng), int(nxyz), int(off), & MPI_ORDER_FORTRAN, MPI_RP, filetype, ie) call MPI_type_commit(filetype, ie) call MPI_Barrier(mpi_comm,ie) call MPI_File_get_position(fh, pos, ie) call MPI_Barrier(mpi_comm,ie) call MPI_File_set_view(fh, pos, MPI_RP, filetype, "native", MPI_INFO_NULL, ie) buffer = BigEnd(Phi(1:nx,1:ny,1:nz)) call MPI_File_write_all(fh, buffer, nx*ny*nz, MPI_RP, MPI_STATUS_IGNORE, ie) call MPI_File_close(fh, ie) end subroutine
The undefined variables come from host association, some error checking removed. I receive this error when running it on a national academic cluster:
*** An error occurred in MPI_Isend *** reported by process [3941400577,18036219417246826496] *** on communicator MPI COMMUNICATOR 20 DUP FROM 0 *** MPI_ERR_BUFFER: invalid buffer pointer *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job)
The error is triggered by the call to
MPI_File_write_all
. I am suspecting it may be connected with size of the buffer which is the fullnx*ny*nz
which is in the order of10^5
to10^6
., but I cannot exclude a programming error on my side, as I have no prior experience with MPI I/O.The MPI implementation used is
OpenMPI 1.8.0
with the Intel Fortran 14.0.2.Do you know how to make it work and write the file?
--- Edit2 ---
Testing a simplified version, the important code remains the same, full source is here. Notice it works with gfortran and fails with different MPI's with Intel. I wasn't able to compile it with PGI. Also I was wrong in that it fails only on different nodes, it fails even in single process run.
>module ad gcc-4.8.1 >module ad openmpi-1.8.0-gcc >mpif90 save.f90 >./a.out Trying to decompose in 1 1 1 process grid. >mpirun a.out Trying to decompose in 1 1 2 process grid. >module rm openmpi-1.8.0-gcc >module ad openmpi-1.8.0-intel >mpif90 save.f90 >./a.out Trying to decompose in 1 1 1 process grid. ERROR write_all MPI_ERR_IO: input/output error >module rm openmpi-1.8.0-intel >module ad openmpi-1.6-intel >mpif90 save.f90 >./a.out Trying to decompose in 1 1 1 process grid. ERROR write_all MPI_ERR_IO: input/output error [luna24.fzu.cz:24260] *** An error occurred in MPI_File_set_errhandler [luna24.fzu.cz:24260] *** on a NULL communicator [luna24.fzu.cz:24260] *** Unknown error [luna24.fzu.cz:24260] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort -------------------------------------------------------------------------- An MPI process is aborting at a time when it cannot guarantee that all of its peer processes in the job will be killed properly. You should double check that everything has shut down cleanly. Reason: After MPI_FINALIZE was invoked Local host: luna24.fzu.cz PID: 24260 -------------------------------------------------------------------------- >module rm openmpi-1.6-intel >module ad mpich2-intel >mpif90 save.f90 >./a.out Trying to decompose in 1 1 1 process grid. ERROR write_all Other I/O error , error stack: ADIOI_NFS_WRITECONTIG(70): Other I/O error Bad a ddress
解决方案In line
buffer = BigEnd(Phi(1:nx,1:ny,1:nz))
the array
buffer
should be allocated automatically to the shape of the right hand side according to the Fortran 2003 standard (not in Fortran 95). Intel Fortran as of version 14 does not do this by default., it requires the option-assume realloc_lhs
to do that. This option is included (with other options) in option
-standard-semantics
Because this option was not in effect when the code in the question was tested the program accessed an unallocated array and undefined behavior leading to a crash followed.
这篇关于执行MPI I / O时MPI_ERR_BUFFER的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!