Duplicate ZeRo 3 Global Step Checkpoint Saves

### System Info

- `transformers` version: 4.45.0
- Platform: Linux-5.10.0-33-cloud-amd64-x86_64-with-glibc2.31
- Python version: 3.10.14
- Huggingface_hub version: 0.26.2
- Safetensors version: 0.4.5
- Accelerate version: 1.0.1
- Accelerate config:    not found
- PyTorch version (GPU?): 2.4.0+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: yes
- Using GPU in script?: yes
- GPU type: NVIDIA A100-SXM4-80GB

### Who can help?

@muellerzr @SunMarc

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction

When using a model with DeepSpeed Stage ZeRo 3 trainer saves the contents of the global step twice causing long checkpoint saving times.

It stems from the following code block in `trainer.py`

```
def _save_checkpoint(self, model, trial):
    # In all cases, including ddp/dp/deepspeed, self.model is always a reference to the model we
    # want to save except FullyShardedDDP.
    # assert unwrap_model(model) is self.model, "internal model should be a reference to self.model"

    # Save model checkpoint
    checkpoint_folder = f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}"

    if self.hp_search_backend is None and trial is None:
        self.store_flos()

    run_dir = self._get_output_dir(trial=trial)
    output_dir = os.path.join(run_dir, checkpoint_folder)
    self.save_model(output_dir, _internal_call=True)

    if not self.args.save_only_model:
        # Save optimizer and scheduler
        self._save_optimizer_and_scheduler(output_dir)
        # Save RNG state
        self._save_rng_state(output_dir)
```

Where it first calls `self.save_model(....)` and then eventually the deepspeed checkpointing function `self.model_wrapped.save_checkpoint(output_dir)`. The next line then calls `self._save_optimizer_and_scheduler(...)` which will end calling `self.model_wrapped.save_checkpoint(output_dir)`.

This results in duplicate saving of the global step and with optimiser states this means long checkpoint saving time.

### Expected behavior

Saving of a global step using ZeRo stage 3 only saves once.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Duplicate ZeRo 3 Global Step Checkpoint Saves #34534

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Duplicate ZeRo 3 Global Step Checkpoint Saves #34534

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions