Training config that worked with transformers v4.4.6.3 results in OOM error with v4.47.0 (using SFTTrainer)

### System Info

```
- `transformers` version: 4.47.0
- Platform: Linux-6.8.0-1015-aws-x86_64-with-glibc2.35
- Python version: 3.12.6
- Huggingface_hub version: 0.26.2
- Safetensors version: 0.4.5
- Accelerate version: 1.1.1
- Accelerate config:    not found
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: Yes
- Using GPU in script?: Yes
- GPU type: NVIDIA A100-SXM4-40GB
```

### Who can help?

@ArthurZucker @SunMarc @muellerz 

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction

Training with transformers==4.46.3 runs as expected. Upgrading to transformers==4.47.0 (without changing anything else) leads to an OOM error in the very first training step (see stack trace below).


Run command: `accelerate launch --config_file ./accelerate_config.yaml train.py training=path/to/training_config`

### Accelerate Config
```
compute_environment: LOCAL_MACHINE                                                                                                                                           
debug: false                                                                                                                                                                 
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: false
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: false
  activation_checkpointing: true
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
```

### Training Config
```
{'accelerator_config': {'dispatch_batches': None,
                        'even_batches': True,
                        'gradient_accumulation_kwargs': None,
                        'non_blocking': False,
                        'split_batches': False,
                        'use_seedable_sampler': True},
 'adafactor': False,
 'adam_beta1': 0.9,
 'adam_beta2': 0.999,
 'adam_epsilon': 1e-08,
 'attn_implementation': 'flash_attention_2',
 'auto_find_batch_size': False,
 'average_tokens_across_devices': False,
 'batch_eval_metrics': False,
 'bf16': 'auto',
 'bf16_full_eval': False,
 'chars_per_token': '<CHARS_PER_TOKEN>',
 'data_seed': None,
 'dataloader_drop_last': False,
 'dataloader_num_workers': 0,
 'dataloader_persistent_workers': False,
 'dataloader_pin_memory': True,
 'dataloader_prefetch_factor': None,
 'dataset_batch_size': 1000,
 'dataset_kwargs': {'skip_prepare_dataset': False},
 'ddp_backend': None,
 'ddp_broadcast_buffers': None,
 'ddp_bucket_cap_mb': None,
 'ddp_find_unused_parameters': None,
 'ddp_timeout': 1800,
 'debug': [],
 'deepspeed': None,
 'delete_ckpts': False,
 'disable_tqdm': False,
 'dispatch_batches': None,
 'do_eval': True,
 'do_predict': False,
 'do_train': False,
 'early_stopping_patience': 10,
 'eval_accumulation_steps': None,
 'eval_delay': 0,
 'eval_do_concat_batches': True,
 'eval_exampleset_info_path': '',
 'eval_exampleset_path': '',
 'eval_on_start': True,
 'eval_packing': False,
 'eval_steps': 10,
 'eval_strategy': 'steps',
 'eval_use_gather_object': False,
 'evaluation_strategy': None,
 'exampleset_info_path': '',
 'exampleset_path': '',
 'force_tokenize_data': False,
 'fp16': False,
 'fp16_backend': 'auto',
 'fp16_full_eval': False,
 'fp16_opt_level': 'O1',
 'fsdp': [],
 'fsdp_config': {'min_num_params': 0,
                 'xla': False,
                 'xla_fsdp_grad_ckpt': False,
                 'xla_fsdp_v2': False},
 'fsdp_min_num_params': 0,
 'fsdp_transformer_layer_cls_to_wrap': None,
 'full_determinism': False,
 'gradient_accumulation_steps': 4,
 'gradient_checkpointing': False,
 'gradient_checkpointing_kwargs': {'use_reentrant': False},
 'greater_is_better': False,
 'group_by_length': False,
 'half_precision_backend': 'auto',
 'hub_always_push': False,
 'hub_model_id': None,
 'hub_private_repo': None,
 'hub_strategy': 'every_save',
 'hub_token': '<HUB_TOKEN>',
 'ignore_data_skip': False,
 'include_for_metrics': [],
 'include_inputs_for_metrics': False,
 'include_num_input_tokens_seen': False,
 'include_tokens_per_second': False,
 'jit_mode_eval': False,
 'label_names': ['labels'],
 'label_smoothing_factor': 0.0,
 'learning_rate': 0.0002,
 'length_column_name': 'length',
 'load_best_model_at_end': True,
 'local_rank': 0,
 'log_level': 'passive',
 'log_level_replica': 'warning',
 'log_on_each_node': True,
 'logging_first_step': False,
 'logging_nan_inf_filter': True,
 'logging_steps': 1,
 'logging_strategy': 'steps',
 'lora_alpha': 32,
 'lora_dropout': 0.05,
 'lora_r': 16,
 'lora_target_modules': ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'up_proj', 'down_proj', 'gate_proj'],
 'lr_scheduler_kwargs': {},
 'lr_scheduler_type': 'cosine',
 'mask_instructions': True,
 'max_grad_norm': 1.0,
 'max_seq_length': 1024,
 'max_steps': 100,
 'meta_data': {},
 'metric_for_best_model': 'loss',
 'model_name_or_path': 'Qwen/Qwen2.5-7B-Instruct',
 'mp_parameters': '',
 'neftune_noise_alpha': None,
 'no_cuda': False,
 'num_of_sequences': 1024,
 'num_train_epochs': 3,
 'optim': 'adamw_torch',
 'optim_args': None,
 'optim_target_modules': None,
 'overwrite_output_dir': False,
 'packing': False,
 'past_index': -1,
 'per_device_eval_batch_size': 1,
 'per_device_train_batch_size': 1,
 'per_gpu_eval_batch_size': None,
 'per_gpu_train_batch_size': None,
 'prediction_loss_only': False,
 'push_to_hub': False,
 'push_to_hub_model_id': None,
 'push_to_hub_organization': None,
 'push_to_hub_token': '<PUSH_TO_HUB_TOKEN>',
 'ray_scope': 'last',
 'remove_unused_columns': True,
 'restore_callback_states_from_checkpoint': False,
 'resume_from_checkpoint': None,
 'save_on_each_node': False,
 'save_only_model': False,
 'save_safetensors': True,
 'save_steps': 20,
 'save_strategy': 'steps',
 'save_total_limit': None,
 'seed': 42,
 'skip_memory_metrics': True,
 'smoke_test': False,
 'split_batches': None,
 'tf32': None,
 'torch_compile': False,
 'torch_compile_backend': None,
 'torch_compile_mode': None,
 'torch_dtype': 'bfloat16',
 'torch_empty_cache_steps': None,
 'torchdynamo': None,
 'tpu_metrics_debug': False,
 'tpu_num_cores': None,
 'use_cpu': False,
 'use_ipex': False,
 'use_legacy_prediction_loop': False,
 'use_liger_kernel': False,
 'use_mps_device': False,
 'use_peft': False,
 'val_set_size': 0.0,
 'warmup_ratio': 0.1,
 'warmup_steps': 0,
 'weight_decay': 0.0}
```


### Training script
```

def main(cfg):
    accelerator = Accelerator()
    model_kwargs = dict(
        attn_implementation=sft_config.attn_implementation,
        torch_dtype=sft_config.torch_dtype,
        use_cache=False,
    )
    model = AutoModelForCausalLM.from_pretrained(sft_config.model_name_or_path, **model_kwargs)
    tokenizer = AutoTokenizer.from_pretrained(sft_config.model_name_or_path, use_fast=True)
    tokenizer.pad_token = tokenizer.eos_token
  
    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        args=sft_config,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        peft_config=None,
        dataset_kwargs=sft_config.dataset_kwargs,
    )

    trainer.train()
    trainer.save_model()

if __name__ == "__main__":
    main()

```

### Stack trace
```
Traceback (most recent call last):
  File "/home/ubuntu/***/train.py", line 233, in main
    trainer.train()
  File "/home/ubuntu/***/.venv/lib/python3.12/site-packages/transformers/trainer.py", line 2164, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/***/.venv/lib/python3.12/site-packages/transformers/trainer.py", line 2522, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/***/.venv/lib/python3.12/site-packages/transformers/trainer.py", line 3653, in training_step
    loss = self.compute_loss(model, inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/***/.venv/lib/python3.12/site-packages/transformers/trainer.py", line 3709, in compute_loss
    outputs = model(**inputs)
              ^^^^^^^^^^^^^^^
  File "/home/ubuntu/***/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/***/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/***/.venv/lib/python3.12/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 864, in forward
    output = self._fsdp_wrapped_module(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/***/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/***/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/***/.venv/lib/python3.12/site-packages/accelerate/utils/operations.py", line 823, in forward
    return model_forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/***/.venv/lib/python3.12/site-packages/accelerate/utils/operations.py", line 811, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/***/.venv/lib/python3.12/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/***/.venv/lib/python3.12/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1184, in forward
    loss = self.loss_function(logits, labels, self.vocab_size, **loss_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/***/.venv/lib/python3.12/site-packages/transformers/loss/loss_utils.py", line 36, in ForCausalLMLoss
    logits = logits.float()
             ^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.97 GiB. GPU 5 has a total capacity of 39.38 GiB of which 1.53 GiB is free. Including non-PyTorch memory, this process has 37.84 GiB memory in use. Of the allocated memory 35.69 GiB is allocated by PyTorch, and 521.06 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
```

### Expected behavior

Training should complete without errors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training config that worked with transformers v4.4.6.3 results in OOM error with v4.47.0 (using SFTTrainer) #35108

System Info

Who can help?

Information

Tasks

Reproduction

Accelerate Config

Training Config

Training script

Stack trace

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Training config that worked with transformers v4.4.6.3 results in OOM error with v4.47.0 (using SFTTrainer) #35108

Description

System Info

Who can help?

Information

Tasks

Reproduction

Accelerate Config

Training Config

Training script

Stack trace

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions