问题描述
尽管已经编写了冗长的,高度并行的代码,并且在三维数组上使用了复杂的发送/接收方式,但是这个带有二维整数数组的简单代码让我非常满意。我为可能的解决方案梳理了stackoverflow,并发现一个与我遇到的问题略有类似: b$ b
然而,这些解决方案似乎将循环代码段视为覆盖内存部分的罪魁祸首。但是这个人似乎更加陌生。也许这是对我的一些简单细节的粗心疏忽。问题在于下面的代码:
程序主体
隐式无
包含' mpif.h'
整数:: i,j
整数::计数器,偏移
整数:: rank,ierr,stVal
整数, 10):: passMat,prntMat !! passMat包含要传递的值prntMat
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr)
counter = 0
offset =(rank + 1)* 300
do j = 1,10
do i = 1,10
prntMat(i,j)= 10! prntMat of BANH RANKS包含10
passMat(i,j)=抵消+计数器! passMat OF rank = 0 CONTAINS 300..399 AND rank = 1 CONTAINS 600..699
counter = counter + 1
end do
end do
if( rank == 1),然后
调用MPI_SEND(passMat(1:10,1:10),100,MPI_INTEGER,0,1,MPI_COMM_WORLD,ierr)! SEND passMat OF rank = 1 to rank = 0
else
call MPI_RECV(prntMat(1:10,1:10),100,MPI_INTEGER,1,1,MPI_COMM_WORLD,stVal,ierr)
do i = 1,10
print *,prntMat(:,i)
end do
end if
call MPI_FINALIZE(ierr)
end程序main
当我用mpif90编译没有标志的代码并用mpirun在我的机器上运行它时 - np 2,我在数组的前四个索引中得到错误值的输出:
0 0 400 0 604 605 606 607 608 609
610 611 612 613 614 615 616 617 618 619
620 621 622 623 624 625 626 627 628 629
630 631 632 633 634 635 636 637 638 63 9
640 641 642 643 644 645 646 647 648 649
650 651 652 653 654 655 656 657 658 659
660 661 662 663 664 666 666 667 668 669
670 671 672 673 674 675 676 677 678 679
680 681 682 683 684 685 686 687 688 689
690 691 692 693 694 695 696 697 698 699
但是,当我使用相同的编译器编译它,但打开了-O3标志时,我得到了正确的输出结果:
600 601 602 603 604 605 606 607 608 609
610 611 612 613 614 615 616 617 618 619
620 621 622 623 624 625 626 627 628 629
630 631 632 633 634 635 636 637 638 639
640 641 642 643 644 645 646 647 648 649
650 651 652 653 654 655 656 657 658 659
660 661 662 663 664 665 666 667 668 669
670 671 672 673 674 675 676 677 678 679
680 681 682 683 684 685 686 687 688 689
690 691 692 693 694 695 696 697 698 699
这个错误是依赖于机器的。这个问题只在运行Ubuntu 14.04.2的系统上出现,使用OpenMPI 1.6.5
我在运行RedHat和CentOS的其他系统上试过这个,代码运行良好没有-O3标志。奇怪的是那些机器使用旧版本的OpenMPI - 1.4我猜测-O3标志正在执行一些奇怪的优化,它正在修改数组传递的方式之间的进程。
我也试过其他版本的数组分配。上面的代码使用显式形状数组。通过假定的形状和分配阵列,即使不是更加奇怪的结果,我也能得到同样的结果,其中有些结果是分段的。我尝试使用Valgrind来追踪这些seg-fault的起源,但我仍然没有得到让Valgrind在运行MPI程序时不会出现误报的窍门。
任何帮助都将不胜感激!这段代码真的让我怀疑我写的所有其他MPI代码是否都是完好的。
使用Fortran 90界面使用Fortran 90界面到MPI显示您调用 MPI_RECV
调用MPI_RECV( prntMat(1:10,1:10),100,MPI_INTEGER,1,1,1,MPI_COMM_WORLD,stVal,ierr)
1
错误:没有针对通用'mpi_recv'的具体子例程(1 )
这是因为状态变量 stVal
是整数
标量,而不是 MPI_STATUS_SIZE
的数组。 F77界面(包含'mpif.h'
)至 MPI_RECV
为:
更改
integer :: rank,ierr,stVal
至
integer :: rank,ierr,stVal(mpi_status_size)
产生一个按预期工作的程序用gfortran 5.1和OpenMPI 1.8.5进行测试。
使用F90界面(使用mpi
vs includempif.h code>)可让编译器在编译时检测不匹配的参数,而不会产生令人困惑的运行时问题。
Despite having written long, heavily parallelized codes with complicated send/receives over three dimensional arrays, this simple code with a two dimensional array of integers has got me at my wits end. I combed stackoverflow for possible solutions and found one that resembled slightly with the issue I am having:
Boost.MPI: What's received isn't what was sent!
However the solutions seem to point the looping segment of code as the culprit for overwriting sections of the memory. But this one seems to act even stranger. Maybe it is a careless oversight of some simple detail on my part. The problem is with the below code:
program main
implicit none
include 'mpif.h'
integer :: i, j
integer :: counter, offset
integer :: rank, ierr, stVal
integer, dimension(10, 10) :: passMat, prntMat !! passMat CONTAINS VALUES TO BE PASSED TO prntMat
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
counter = 0
offset = (rank + 1)*300
do j = 1, 10
do i = 1, 10
prntMat(i, j) = 10 !! prntMat OF BOTH RANKS CONTAIN 10
passMat(i, j) = offset + counter !! passMat OF rank=0 CONTAINS 300..399 AND rank=1 CONTAINS 600..699
counter = counter + 1
end do
end do
if (rank == 1) then
call MPI_SEND(passMat(1:10, 1:10), 100, MPI_INTEGER, 0, 1, MPI_COMM_WORLD, ierr) !! SEND passMat OF rank=1 to rank=0
else
call MPI_RECV(prntMat(1:10, 1:10), 100, MPI_INTEGER, 1, 1, MPI_COMM_WORLD, stVal, ierr)
do i = 1, 10
print *, prntMat(:, i)
end do
end if
call MPI_FINALIZE(ierr)
end program main
When I compile the code with mpif90 with no flags and run it on my machine with mpirun -np 2, I get the following output with wrong values in the first four indices of the array:
0 0 400 0 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699
However, when I compile it with the same compiler but with the -O3 flag on, I get the correct output:
600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699
This error is machine dependent. This issue turns up only on my system running Ubuntu 14.04.2, using OpenMPI 1.6.5
I tried this on other systems running RedHat and CentOS and the code ran well with and without the -O3 flag. Curiously those machines use an older version of OpenMPI - 1.4
I am guessing that the -O3 flag is performing some odd optimization that is modifying the manner in which arrays are being passed between the processes.
I also tried other versions of array allocation. The above code uses explicit shape arrays. With assumed shape and allocated arrays I am receiving equally, if not more bizarre results, with some of them seg-faulting. I tried using Valgrind to trace the origin of these seg-faults, but I still haven't gotten the hang of getting Valgrind to not give false positives when running with MPI programs.
I believe that resolving the difference in performance of the above code will help me understand the tantrums of my other codes as well.
Any help would be greatly appreciated! This code has really gotten me questioning if all the other MPI codes I wrote are sound at all.
Using the Fortran 90 interface to MPI reveals a mismatch in your call to MPI_RECV
call MPI_RECV(prntMat(1:10, 1:10), 100, MPI_INTEGER, 1, 1, MPI_COMM_WORLD, stVal, ierr)
1
Error: There is no specific subroutine for the generic ‘mpi_recv’ at (1)
This is because the status variable stVal
is an integer
scalar, rather than an array of MPI_STATUS_SIZE
. The F77 interface (include 'mpif.h'
) to MPI_RECV
is:
Changing
integer :: rank, ierr, stVal
to
integer :: rank, ierr, stVal(mpi_status_size)
produces a program that works as expected, tested with gfortran 5.1 and OpenMPI 1.8.5.
Using the F90 interface (use mpi
vs include "mpif.h"
) lets the compiler detect the mismatched arguments at compile time rather than producing confusing runtime problems.
这篇关于MPI Fortran编译器优化错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!