You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For post-processing, calling restart multiple times in a loop like this will eventually hit the limit.
character*80 fname
fname = 't.fld '
do i=1,1048576
call blank(initc,132)
call chcopy (initc,fname,80)
998 call bcast(initc,80)
127 format(a127)
if (nio.eq.0) write(*,*)'open file test',i,fname
call flush()
nfiles = 1
call restart(nfiles) ! Note -- time is reset.
enddo
call exitt0
It either hangs forever (possible due to increased memory usage and waiting for RAM/swap), or print an error like this (on a machine using mpich 4.0) around 1935th file. The limit varies on different machines and can be as low as 2048.
MPIR_Get_contextid_sparse_group(591): Too many communicators (0/2048 free on this process; ignore_id=0)
Abort(134825487) on node 7 (rank 7 in comm 0): Fatal error in internal_Comm_dup: Other MPI error, error stack:
internal_Comm_dup(85)...............: MPI_Comm_dup(comm=0x84000001, newcomm=0x7ffef189ab38) failed
This issue seems to be introduced by the recent MPI restart with commit 636d0b5 caused by call mpi_comm_dup(nekcomm,commrs,ierr). at the line
Solution 1, Reverting to an older version, says f0d0420
Solution 2, Preserve the duplicated commrs for restart only, and only call mpi_comm_dup once.
On my laptop, it successfully calls 1048576 times without breaking.
data icalld/0/
save icalld
...
if (icalld.eq.0) then
call mpi_comm_dup(nekcomm,commrs,ierr)
endif
icalld = 1
Bug description
For post-processing, calling
restart
multiple times in a loop like this will eventually hit the limit.It either hangs forever (possible due to increased memory usage and waiting for RAM/swap), or print an error like this (on a machine using mpich 4.0) around 1935th file. The limit varies on different machines and can be as low as 2048.
This issue seems to be introduced by the recent MPI restart with commit 636d0b5 caused by
call mpi_comm_dup(nekcomm,commrs,ierr)
. at the lineNek5000/core/ic.f
Line 2408 in 4ae2620
Workaround:
commrs
for restart only, and only callmpi_comm_dup
once.On my laptop, it successfully calls 1048576 times without breaking.
Extra
Thakur, Rajeev, et al. "MPI at Exascale." Procceedings of SciDAC 2 (2010): 14-35.
https://www.researchgate.net/publication/260402166_MPI_at_Exascale
The text was updated successfully, but these errors were encountered: