(non-)creation of .empy_file in wrapper can mask underlying errors

In a recent occurrence we had thousands of jobs go held with errors like `reading from file /storage/local/data1/condor/execute/dir_1221925/.empty_file: (errno 2) No such file or directory`, with the underlying cause actually being an issue with apptainer on the worker node, so the entire payload including the wrapper script could not be run. 

The `.empty_file` is currently created in the wrapper script at https://github.com/fermitools/jobsub_lite/blob/2d2b350d9c0b389a910ffc770e999a7292690551/templates/simple/simple.sh#L193

We should come up with a way to unmask errors like that, if indeed we get something more useful (which I don't know and don't immediately know how to test, without a known-bad worker node). Maybe the job wouldn't even go held, and would just get re-queued?

1. Ideally we wouldn't even have to transfer back a dummy file, but that only seems possible currently by setting `transfer_output = False`, which is only applicable to the `Grid` universe for some reason. Could ask Condor team about that. Or look into using `Grid` universe. That's a big solution to a little problem though.
2. Maybe if we created `.empty_file` at submission time, and added it to `transfer_input_files`, that would make it always available? (barring some error transferring input, which would be a problem regardless)
3. something else?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(non-)creation of .empy_file in wrapper can mask underlying errors #560

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

(non-)creation of .empy_file in wrapper can mask underlying errors #560

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions