Skip to content

Duplicate ZeRo 3 Global Step Checkpoint Saves #34534

Closed
@TobyDrane

Description

@TobyDrane

System Info

  • transformers version: 4.45.0
  • Platform: Linux-5.10.0-33-cloud-amd64-x86_64-with-glibc2.31
  • Python version: 3.10.14
  • Huggingface_hub version: 0.26.2
  • Safetensors version: 0.4.5
  • Accelerate version: 1.0.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.4.0+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: yes
  • Using GPU in script?: yes
  • GPU type: NVIDIA A100-SXM4-80GB

Who can help?

@muellerzr @SunMarc

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

When using a model with DeepSpeed Stage ZeRo 3 trainer saves the contents of the global step twice causing long checkpoint saving times.

It stems from the following code block in trainer.py

def _save_checkpoint(self, model, trial):
    # In all cases, including ddp/dp/deepspeed, self.model is always a reference to the model we
    # want to save except FullyShardedDDP.
    # assert unwrap_model(model) is self.model, "internal model should be a reference to self.model"

    # Save model checkpoint
    checkpoint_folder = f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}"

    if self.hp_search_backend is None and trial is None:
        self.store_flos()

    run_dir = self._get_output_dir(trial=trial)
    output_dir = os.path.join(run_dir, checkpoint_folder)
    self.save_model(output_dir, _internal_call=True)

    if not self.args.save_only_model:
        # Save optimizer and scheduler
        self._save_optimizer_and_scheduler(output_dir)
        # Save RNG state
        self._save_rng_state(output_dir)

Where it first calls self.save_model(....) and then eventually the deepspeed checkpointing function self.model_wrapped.save_checkpoint(output_dir). The next line then calls self._save_optimizer_and_scheduler(...) which will end calling self.model_wrapped.save_checkpoint(output_dir).

This results in duplicate saving of the global step and with optimiser states this means long checkpoint saving time.

Expected behavior

Saving of a global step using ZeRo stage 3 only saves once.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions