Make it possible to save and evaluate checkpoint on CTRL+C / KeyboardInterrupt
with Hugging Face Trainer
#35033
Labels
Feature request
Request for a new feature
Feature request
I would like to request that one or more optional flags be added to the Hugging Face Trainer, perhaps named
save_on_exit
/save_on_interrupt
andeval_on_exit
/eval_on_interrupt
, to ensure that a checkpoint is always saved upon CTRL+C or perhaps even a kill command sent viawandb
.Motivation
Extremely often I will be training a model but then need to utilise the GPU on which I am conducting the training and so I need to pause my training. For example, over the past month, I have been training models on my PC but often I have needed to use my GPU and so I have exited my training and then resumed it at night.
Because, at present, it is not possible to have the Hugging Face Trainer save a checkpoint upon recieving a
KeyboardInterrupt
, I end up needing to save checkpoints at excessively short intervals to minimise lost progress if I quit training at any arbitrary point. This invariably still ends up with some amount of progress being lost and it also does a lot of write wear on my SSDs, which, like all hard drives, have a limited write lifetime. The wear can in fact add up to quite a lot of writing if you are saving multigigabyte models.By allowing for progress to automatically save upon exit, I can be assured that, barring an unexpected system crash, or repeated CTRL+Cs being sent, my progress will always be saved and so I do not need to save and evaluate checkpoints so frequently.
Your contribution
N/A
The text was updated successfully, but these errors were encountered: