Description
Describe the problem you're observing
I was doing some testing using the write-static code and found a subtle bug in
UnifyFS/examples/src/testutil.h
Line 1251 in 58ece44
Describe how to reproduce the problem
When the test file already exists, setting the "remove and reuse test file" flag (-r) would easily trigger the bug.
mpirun -n 2 write-static -n 1 -r
Issue
At the end of the test_remove_file()
, we do a barrier()
to sync processes.
The problem is, there is a potential execution path where one or more processes may exit the function earlier, causing wrong barrier matching.
UnifyFS/examples/src/testutil.h
Lines 1262 to 1268 in 58ece44
This part of code assumes all ranks check the existence of file, get rc = 0, and continue.
Problematic execution sequence:
Rank 0 calls stat(), get rc = 0, then go ahead delete the file. Then Rank 1 came, its stat() call will return -1 since Rank 0 has already deleted the file. Then Rank 1 exits the function without calling the barrier() at the end.
Fix
Add a barrier immediately after stat() to make sure everyone sees the same result.
Overall, we should be careful when using "return" if we will do a barrier later. We need to make sure all processes will always follow the same path: either all return, or all call barrier.
Activity