Skip to content

(non-)creation of .empy_file in wrapper can mask underlying errors #560

Open
@retzkek

Description

@retzkek

In a recent occurrence we had thousands of jobs go held with errors like reading from file /storage/local/data1/condor/execute/dir_1221925/.empty_file: (errno 2) No such file or directory, with the underlying cause actually being an issue with apptainer on the worker node, so the entire payload including the wrapper script could not be run.

The .empty_file is currently created in the wrapper script at

touch .empty_file

We should come up with a way to unmask errors like that, if indeed we get something more useful (which I don't know and don't immediately know how to test, without a known-bad worker node). Maybe the job wouldn't even go held, and would just get re-queued?

  1. Ideally we wouldn't even have to transfer back a dummy file, but that only seems possible currently by setting transfer_output = False, which is only applicable to the Grid universe for some reason. Could ask Condor team about that. Or look into using Grid universe. That's a big solution to a little problem though.
  2. Maybe if we created .empty_file at submission time, and added it to transfer_input_files, that would make it always available? (barring some error transferring input, which would be a problem regardless)
  3. something else?

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestfutureFeature requests we will not address now, but at some point in the future.

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions