问题描述
我正在尝试使用OpenMP在涡流仿真上并行运行代码.这些类似于粒子模拟,其中在每个时间步长,必须根据其速度来计算下一时间步长的涡流位置,该速度由当前时间步长中所有其他涡流的位置确定.涡旋离开域后将被删除.我将并行版本的代码与串行版本的代码在每个时间步的漩涡数量进行比较,并多次运行每个版本.
对于串行版本,涡旋计数在每个时间步均完全匹配.对于并行案例,所有运行都与串行案例匹配了几十个时间步,此后,每次并行运行都显示出差异,但与串行案例的误差范围保持在7-10%(如结果链接如下).我知道这可能是由于并行情况下的舍入误差所致,这是由于不同线程之间的分配所导致的计算步骤顺序的差异,但是误差真的应该高达10%吗?
我只在并行do构造中使用了reduce子句.整个代码中唯一的并行区域位于函数 vblob()
中,该函数位于模块内部,我从主代码中调用该函数. vblob()
中的所有函数调用都在 ixi()
, fxi()
中.
函数vblob(blobs,xj,gj)complex(8),intent(in):: blobs(:, :),xjcomplex(8):: delxi,delxic,di,gvic,xi实数(8),intent(in):: gj实(8):: vblob(2)整数:: pgvic = 0.0;delxi = 0.0;delxic = 0.0;di = 0.0;xi = 0.0$ omp parallel做private(xi,delxic,delxi,di)共享(xj)约简(+:gvic)做p = 1,size(blobs,1)xi = ixi(blobs(p,1))delxic = xj-conjg(xi)delxi = xj-xidi = del * fxi(xi)gvic = gvic +实数(blobs(p,2))* 1/delxic如果(abs(delxi)> 1.E-4)然后gvic = gvic +(-1)*实数(blobs(p,2))* 1/delxi万一做完!$ omp结束并行gvic = j * gvic * fxi(xj)/(2 * pi)vblob(1)=实数(gvic)vblob(2)= -imag(gvic)终端功能vblob
如果我构建并行代码的方式是错误的,那么错误应该从最初的几个时间步骤本身就显示出来,对吧?
(如该结果所示,'和'sheets'只是涡流元素的类型,蓝线是总元素,P和S分别代表并行和串行,R代表运行,实线标记是序列号,空心标记是三个并行代码的运行)
当我改为将变量的数值精度更改为real(4)时,结果中的divergenec发生在比上述real(8)情况更早的时间步上.因此,这显然是一个全面的错误问题.
TLDR:我想与其他在一系列时间步长上看到这种结果的人一起澄清一下,其中并行代码在前几个时间步长匹配,然后发散?
您的代码本质上总结了 gvic
中的许多术语.浮点算术不是关联的,即,由于舍入,(a + b)+ c
与 a +(b + c)
不同.此外,根据值和术语上的符号,每次操作中的精度可能会严重下降.请参阅此处以了解真正强制性的阅读该主题.
顺序循环计算时(没有明智的编译器优化):
gvic =(...(((((g_1 + g_2)+ g_3)+ g_4)+ g_5)+ ...)
其中 g_i
是通过迭代 i
添加到 gvic
的值,并行版本计算:
gvic = t_0 + t_1 + t_2 + ... t _(#threads-1)
其中 t_i
是线程 i
中 gvic
的累积私有值(即使在Fortran中,OpenMP中的线程也为0编号).未指定不同的 t_i
s的还原顺序.OpenMP实现可以自由选择它认为合适的任何东西.即使所有 t_i
都按顺序求和,结果仍将与顺序循环计算的结果不同.不稳定的数值算法在并行化时极易产生不同的结果.
这是您几乎无法完全避免的事情,而是学会控制或仅仅承担其后果.在许多情况下,无论如何,问题的数值解决方案都是近似的.您应该关注保守或统计属性.例如,遍历分子动力学模拟可能会并行产生完全不同的相轨迹,但诸如总能量或热力学平均值之类的值将非常接近(除非存在严重的算法错误或数值不稳定).>
旁注-当大多数CPU使用标准的32位和64位浮点算术时,您实际上现在很幸运进入此字段.多年前,当使用x87时,浮点运算是以80位内部精度完成的,最终结果取决于值离开并重新输入FPU寄存器的次数.
I am trying to run a code on vortex simulations in parallel using OpenMP. These are similar to particle simulations where at each time step, the position of a vortex at the next time step has to be computed from its velocity which is determined by the positions of all the other vortices at the current time step. The vortices are deleted once they leave the domain. I compare the number of vortices at each time step for the parallel version of code with the serial version of code, and run each version multiple times.
For the serial versions, vortex counts match exactly at every time step. For the parallel case, all the runs match with the serial case for a few tens of time steps, post which, each parallel run shows a difference but remains within a 7-10% error bound with the serial case (as can be seen in the result link below). I know that this may be because of the round off errors in the parallel case owing from the difference in the order of computational steps due to distribution among the different threads, but should the error really be so high as 10%?
I have only used the reduction clause in a parallel do construct. The only parallel region in the whole code is within a function vblob()
, which is inside a module, which I call from a main code. All function calls within vblob()
are ixi()
, fxi()
are outside this module.
function vblob(blobs,xj,gj)
complex(8), intent(in) :: blobs(:,:), xj
complex(8) :: delxi, delxic, di, gvic, xi
real(8), intent(in) :: gj
real(8) :: vblob(2)
integer :: p
gvic = 0.0; delxi = 0.0; delxic = 0.0; di = 0.0; xi = 0.0
!$omp parallel do private(xi,delxic,delxi,di) shared(xj) reduction(+:gvic)
do p = 1, size(blobs,1)
xi = ixi(blobs(p,1))
delxic = xj-conjg(xi)
delxi = xj-xi
di = del*fxi(xi)
gvic = gvic + real(blobs(p,2))*1/delxic
if (abs(delxi) .gt. 1.E-4) then
gvic = gvic + (-1)*real(blobs(p,2))*1/delxi
end if
end do
!$omp end parallel do
gvic = j*gvic*fxi(xj)/(2*pi)
vblob(1) = real(gvic)
vblob(2) = -imag(gvic)
end function vblob
If the way I have constructed the parallel code is wrong, then errors should show up from the first few time steps itself, right?
(As can be seen in this result, the 'blobs' and 'sheets' are just types of vortex elements, the blue line is the total elements. P and S stand for Parallel and serial respectively and R stands for runs. THe solid plot markers are the serial code and the hollow ones are the three runs of the parallel code)
EDIT: When i change the numerical precision of my variables to real(4) instead, the divergenec in results happens at an earlier time step than the real(8) case above. SO its clearly a round off error issue.
TLDR: I want to clarify this with anyone else who has seen such a result over a range of time steps, where the parallel code matches for the first few time steps and then diverges?
Your code essentially sums up a lot of terms in gvic
. Floating-point arithmetic is not associative, that is, (a+b)+c
is not the same as a+(b+c)
due to rounding. Also, depending on the values and the signs on the terms, there may be a serious loss of precision in each operation. See here for a really mandatory read on the subject.
While the sequential loop computes (given no clever compiler optimisations):
gvic = (...((((g_1 + g_2) + g_3) + g_4) + g_5) + ...)
where g_i
is the value added to gvic
by iteration i
, the parallel version computes:
gvic = t_0 + t_1 + t_2 + ... t_(#threads-1)
where t_i
is the accumulated private value of gvic
in thread i
(threads in OpenMP are 0-numbered even in Fortran). The order in which the different t_i
s are reduced is unspecified. The OpenMP implementation is free to choose whatever it deems fine. Even if all t_i
s are summed in order, the result will still differ from the one computed by the sequential loop. Unstable numerical algorithms are exceptionally prone to producing different results when parallelised.
This is something you can hardly avoid completely, but instead learn to control or simply live with its consequences. In many cases, the numerical solution to a problem is an approximation anyway. You should focus on conserved or statistical properties. For example, an ergodic molecular dynamics simulation may produce a completely different phase trajectory in parallel, but values such as the total energy or the thermodynamic averages will be pretty close (unless there is some serious algorithmic error or really bad numerical instability).
A side note - you are actually lucky to enter this field now, when most CPUs use standard 32- and 64-bit floating-point arithmetic. Years ago, when x87 was a thing, floating-point operations were done with 80-bit internal precision and the end result would depend on how many times a value leaves and re-enters the FPU registers.
这篇关于与串行和其他并行运行相比,并行仿真在某些时间步长后得出不同的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!