Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it possible to save and evaluate checkpoint on CTRL+C / KeyboardInterrupt with Hugging Face Trainer #35033

Open
umarbutler opened this issue Dec 1, 2024 · 2 comments
Labels
Feature request Request for a new feature

Comments

@umarbutler
Copy link
Contributor

Feature request

I would like to request that one or more optional flags be added to the Hugging Face Trainer, perhaps named save_on_exit/save_on_interrupt and eval_on_exit/eval_on_interrupt, to ensure that a checkpoint is always saved upon CTRL+C or perhaps even a kill command sent via wandb.

Motivation

Extremely often I will be training a model but then need to utilise the GPU on which I am conducting the training and so I need to pause my training. For example, over the past month, I have been training models on my PC but often I have needed to use my GPU and so I have exited my training and then resumed it at night.

Because, at present, it is not possible to have the Hugging Face Trainer save a checkpoint upon recieving a KeyboardInterrupt, I end up needing to save checkpoints at excessively short intervals to minimise lost progress if I quit training at any arbitrary point. This invariably still ends up with some amount of progress being lost and it also does a lot of write wear on my SSDs, which, like all hard drives, have a limited write lifetime. The wear can in fact add up to quite a lot of writing if you are saving multigigabyte models.

By allowing for progress to automatically save upon exit, I can be assured that, barring an unexpected system crash, or repeated CTRL+Cs being sent, my progress will always be saved and so I do not need to save and evaluate checkpoints so frequently.

Your contribution

N/A

@umarbutler umarbutler added the Feature request Request for a new feature label Dec 1, 2024
@Rocketknight1
Copy link
Member

cc @muellerzr @SunMarc

@tqpatil
Copy link

tqpatil commented Dec 6, 2024

Temporary solution:

You could try catching your keyboard interrupts or other program crashes and save your model manually with a call to save_pretrained() or any other method based on your needs before your program exits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

3 participants