Closed
Description
System Info
transformers
version: 4.45.0- Platform: Linux-5.10.0-33-cloud-amd64-x86_64-with-glibc2.31
- Python version: 3.10.14
- Huggingface_hub version: 0.26.2
- Safetensors version: 0.4.5
- Accelerate version: 1.0.1
- Accelerate config: not found
- PyTorch version (GPU?): 2.4.0+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: yes
- Using GPU in script?: yes
- GPU type: NVIDIA A100-SXM4-80GB
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
When using a model with DeepSpeed Stage ZeRo 3 trainer saves the contents of the global step twice causing long checkpoint saving times.
It stems from the following code block in trainer.py
def _save_checkpoint(self, model, trial):
# In all cases, including ddp/dp/deepspeed, self.model is always a reference to the model we
# want to save except FullyShardedDDP.
# assert unwrap_model(model) is self.model, "internal model should be a reference to self.model"
# Save model checkpoint
checkpoint_folder = f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}"
if self.hp_search_backend is None and trial is None:
self.store_flos()
run_dir = self._get_output_dir(trial=trial)
output_dir = os.path.join(run_dir, checkpoint_folder)
self.save_model(output_dir, _internal_call=True)
if not self.args.save_only_model:
# Save optimizer and scheduler
self._save_optimizer_and_scheduler(output_dir)
# Save RNG state
self._save_rng_state(output_dir)
Where it first calls self.save_model(....)
and then eventually the deepspeed checkpointing function self.model_wrapped.save_checkpoint(output_dir)
. The next line then calls self._save_optimizer_and_scheduler(...)
which will end calling self.model_wrapped.save_checkpoint(output_dir)
.
This results in duplicate saving of the global step and with optimiser states this means long checkpoint saving time.
Expected behavior
Saving of a global step using ZeRo stage 3 only saves once.