Skip to content

MPI collective I/O and UnifyFS #781

Open
@adammoody

Description

With the collective write calls in MPI I/O, the MPI library may rearrange data among processes to write to the underlying file more efficiently, as is done in ROMIO's collective buffering. The user does not know which process actually writes to the file, even if they know which process provides the source data and file offset to be written.

An application may be written such that a given process writes twice to the same file offset using collective write calls. Since the same process writes to the same offset, the MPI standard does not require the application to call MPI_File_sync() between those writes. However, depending on the MPI implementation, those actual writes may happen from two different processes.

As an example taken from PnetCDF, it is common to set default values for variables in a file using fill calls and then later write actual data to those variables. The fill calls use collective I/O, whereas the later write call may not. In this case, two different processes can write to the same file offset, one process with the fill value, and a second process with the actual data. In UnifyFS, these two writes need to be separated with a sync-barrier-sync to establish an order between them.

It may be necessary to ask users to do at least one of the following:

  • set UNIFYFS_CLIENT_WRITE_SYNC=1 if using collective write calls (one might still need a barrier after all syncs)
  • call MPI_File_sync() + MPI_Barrier() after any collective write call
  • disable ROMIO's collective buffering feature

Need to review the MPI standard:

  1. I don't recall of the top of my head what the standard says about MPI_File_sync in the case that the application knowingly writes to the same file offset from two different ranks using two collective write calls. Is MPI_File_sync needed in between or not?
  2. I'm pretty sure that MPI_File_sync is not required when the same process writes to the same offset in two different write calls.

Regardless, I suspect very few applications currently call MPI_File_sync in either situation. Even if the standard requires it, we need to call this out.

The UnifyFS-enabled ROMIO could sync extents and then call barrier on its collective write calls. This would ensure all writes are visible upon returning from the collective write.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions