Description
In a recent occurrence we had thousands of jobs go held with errors like reading from file /storage/local/data1/condor/execute/dir_1221925/.empty_file: (errno 2) No such file or directory
, with the underlying cause actually being an issue with apptainer on the worker node, so the entire payload including the wrapper script could not be run.
The .empty_file
is currently created in the wrapper script at
jobsub_lite/templates/simple/simple.sh
Line 193 in 2d2b350
We should come up with a way to unmask errors like that, if indeed we get something more useful (which I don't know and don't immediately know how to test, without a known-bad worker node). Maybe the job wouldn't even go held, and would just get re-queued?
- Ideally we wouldn't even have to transfer back a dummy file, but that only seems possible currently by setting
transfer_output = False
, which is only applicable to theGrid
universe for some reason. Could ask Condor team about that. Or look into usingGrid
universe. That's a big solution to a little problem though. - Maybe if we created
.empty_file
at submission time, and added it totransfer_input_files
, that would make it always available? (barring some error transferring input, which would be a problem regardless) - something else?