Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple restarts creates too many communicators #836

Open
yslan opened this issue Aug 5, 2024 · 1 comment
Open

Multiple restarts creates too many communicators #836

yslan opened this issue Aug 5, 2024 · 1 comment

Comments

@yslan
Copy link
Contributor

yslan commented Aug 5, 2024

Bug description

For post-processing, calling restart multiple times in a loop like this will eventually hit the limit.

      character*80 fname
      
      fname = 't.fld '
      do i=1,1048576
        call blank(initc,132)
        call chcopy (initc,fname,80)
  998   call bcast(initc,80)
  127   format(a127)
      
        if (nio.eq.0) write(*,*)'open file test',i,fname
        call flush()
        nfiles = 1
        call restart(nfiles)  ! Note -- time is reset.
      enddo
      call exitt0

It either hangs forever (possible due to increased memory usage and waiting for RAM/swap), or print an error like this (on a machine using mpich 4.0) around 1935th file. The limit varies on different machines and can be as low as 2048.

MPIR_Get_contextid_sparse_group(591): Too many communicators (0/2048 free on this process; ignore_id=0)
Abort(134825487) on node 7 (rank 7 in comm 0): Fatal error in internal_Comm_dup: Other MPI error, error stack:
internal_Comm_dup(85)...............: MPI_Comm_dup(comm=0x84000001, newcomm=0x7ffef189ab38) failed

This issue seems to be introduced by the recent MPI restart with commit 636d0b5 caused by call mpi_comm_dup(nekcomm,commrs,ierr). at the line

Nek5000/core/ic.f

Line 2408 in 4ae2620

call mpi_comm_dup(nekcomm,commrs,ierr)

Workaround:

  • Solution 1, Reverting to an older version, says f0d0420
  • Solution 2, Preserve the duplicated commrs for restart only, and only call mpi_comm_dup once.
    On my laptop, it successfully calls 1048576 times without breaking.
        data icalld/0/
        save icalld
        ...
    
        if (icalld.eq.0) then
           call mpi_comm_dup(nekcomm,commrs,ierr)
        endif
        icalld = 1
    

Extra

@stgeke
Copy link
Contributor

stgeke commented Aug 6, 2024

Fixed in 4099a7b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants