Skip to content

PnetCDF mcoll_perf detects incorrect data #757

Open
@adammoody

Description

The test/nonblocking/mcoll_perf.c test detects incorrect data when comparing two files that were written two different ways which should have identical content.

cd test/nonblocking
srun -n2 ./mcoll_perf /unifyfs/testfile.nc
<snip>
P0: diff at line 282 variable[2] var1_2: NC_INT buf1 != buf2 at position 32762

After tracing pwrite and pread calls under a debugger, the problem is that both ranks write to the same byte offsets without any synchronization in between. In this case, rank 1 writes a fill value and rank 0 later writes actual data. It's a race as to which value actually ends up in the file.

The fill call is here:

https://github.com/Parallel-NetCDF/PnetCDF/blob/bb59553ca3542bc09ead12c6ce4e65b913ef51fa/test/nonblocking/mcoll_perf.c#L521

When filling the variable 2, rank 1 writes to (offset=648, length=8) and (offset=680, length=8).

And the write call is here:

https://github.com/Parallel-NetCDF/PnetCDF/blob/bb59553ca3542bc09ead12c6ce4e65b913ef51fa/test/nonblocking/mcoll_perf.c#L526

In that write, rank 0 writes to (offset=640, length=16) and (offset=672, length=16), which overlaps with the region that rank 1 wrote to during the fill operation.

The test case can be fixed by adding a call to ncmpi_sync(ncid);:

           for (i=2; i<nvars; i++){
                /* fill record variables to silence valgrind complaining about uninitialised bytes */
                for (j=0; j<array_of_gsizes[0]; j++) {
                    err = ncmpi_fill_var_rec(ncid, varid[i], j);
                    CHECK_ERR
                }
            }
            ncmpi_sync(ncid); // <--- add sync here to fix the test case
            for (i=0; i<nvars; i++){
                err = ncmpi_put_vara_all(ncid, varid[i], starts[i], counts[i], buf[i], bufcounts[i], MPI_INT);
                CHECK_ERR
            }

For reference, here is the sequence of (offset, length) values for writes from different ranks when k==0. There are multiple overlapping writes, one of which is shown below:

offset, length values for writes
--------  -------
rank 0    rank 1
--------  -------
  0, 336
512, 32   544, 32
576, 32   608, 32
640, 8    648, 8  <--- this "fill" by rank 1
  4, 4
672, 8    680, 8
  4, 4
704, 8    712, 8
  4, 4
736, 8    744, 8
  4, 4
656, 8    664, 8
688, 8    696, 8
720, 8    728, 8
752, 8    760, 8
512, 32   544, 32
576, 32   608, 32
640, 16   704, 16  <-- overlaps with this "put" by rank 0
672, 16   736, 16
656, 16   720, 16
688, 16   752, 16

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions