-
Notifications
You must be signed in to change notification settings - Fork 31.5k
Closed
Labels
Description
System Info
- `transformers` version: 4.47.0
- Platform: Linux-6.8.0-1015-aws-x86_64-with-glibc2.35
- Python version: 3.12.6
- Huggingface_hub version: 0.26.2
- Safetensors version: 0.4.5
- Accelerate version: 1.1.1
- Accelerate config: not found
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: Yes
- Using GPU in script?: Yes
- GPU type: NVIDIA A100-SXM4-40GB
Who can help?
@ArthurZucker @SunMarc @muellerz
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Training with transformers==4.46.3 runs as expected. Upgrading to transformers==4.47.0 (without changing anything else) leads to an OOM error in the very first training step (see stack trace below).
Run command: accelerate launch --config_file ./accelerate_config.yaml train.py training=path/to/training_config
Accelerate Config
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_forward_prefetch: false
fsdp_offload_params: false
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: false
activation_checkpointing: true
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Training Config
{'accelerator_config': {'dispatch_batches': None,
'even_batches': True,
'gradient_accumulation_kwargs': None,
'non_blocking': False,
'split_batches': False,
'use_seedable_sampler': True},
'adafactor': False,
'adam_beta1': 0.9,
'adam_beta2': 0.999,
'adam_epsilon': 1e-08,
'attn_implementation': 'flash_attention_2',
'auto_find_batch_size': False,
'average_tokens_across_devices': False,
'batch_eval_metrics': False,
'bf16': 'auto',
'bf16_full_eval': False,
'chars_per_token': '<CHARS_PER_TOKEN>',
'data_seed': None,
'dataloader_drop_last': False,
'dataloader_num_workers': 0,
'dataloader_persistent_workers': False,
'dataloader_pin_memory': True,
'dataloader_prefetch_factor': None,
'dataset_batch_size': 1000,
'dataset_kwargs': {'skip_prepare_dataset': False},
'ddp_backend': None,
'ddp_broadcast_buffers': None,
'ddp_bucket_cap_mb': None,
'ddp_find_unused_parameters': None,
'ddp_timeout': 1800,
'debug': [],
'deepspeed': None,
'delete_ckpts': False,
'disable_tqdm': False,
'dispatch_batches': None,
'do_eval': True,
'do_predict': False,
'do_train': False,
'early_stopping_patience': 10,
'eval_accumulation_steps': None,
'eval_delay': 0,
'eval_do_concat_batches': True,
'eval_exampleset_info_path': '',
'eval_exampleset_path': '',
'eval_on_start': True,
'eval_packing': False,
'eval_steps': 10,
'eval_strategy': 'steps',
'eval_use_gather_object': False,
'evaluation_strategy': None,
'exampleset_info_path': '',
'exampleset_path': '',
'force_tokenize_data': False,
'fp16': False,
'fp16_backend': 'auto',
'fp16_full_eval': False,
'fp16_opt_level': 'O1',
'fsdp': [],
'fsdp_config': {'min_num_params': 0,
'xla': False,
'xla_fsdp_grad_ckpt': False,
'xla_fsdp_v2': False},
'fsdp_min_num_params': 0,
'fsdp_transformer_layer_cls_to_wrap': None,
'full_determinism': False,
'gradient_accumulation_steps': 4,
'gradient_checkpointing': False,
'gradient_checkpointing_kwargs': {'use_reentrant': False},
'greater_is_better': False,
'group_by_length': False,
'half_precision_backend': 'auto',
'hub_always_push': False,
'hub_model_id': None,
'hub_private_repo': None,
'hub_strategy': 'every_save',
'hub_token': '<HUB_TOKEN>',
'ignore_data_skip': False,
'include_for_metrics': [],
'include_inputs_for_metrics': False,
'include_num_input_tokens_seen': False,
'include_tokens_per_second': False,
'jit_mode_eval': False,
'label_names': ['labels'],
'label_smoothing_factor': 0.0,
'learning_rate': 0.0002,
'length_column_name': 'length',
'load_best_model_at_end': True,
'local_rank': 0,
'log_level': 'passive',
'log_level_replica': 'warning',
'log_on_each_node': True,
'logging_first_step': False,
'logging_nan_inf_filter': True,
'logging_steps': 1,
'logging_strategy': 'steps',
'lora_alpha': 32,
'lora_dropout': 0.05,
'lora_r': 16,
'lora_target_modules': ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'up_proj', 'down_proj', 'gate_proj'],
'lr_scheduler_kwargs': {},
'lr_scheduler_type': 'cosine',
'mask_instructions': True,
'max_grad_norm': 1.0,
'max_seq_length': 1024,
'max_steps': 100,
'meta_data': {},
'metric_for_best_model': 'loss',
'model_name_or_path': 'Qwen/Qwen2.5-7B-Instruct',
'mp_parameters': '',
'neftune_noise_alpha': None,
'no_cuda': False,
'num_of_sequences': 1024,
'num_train_epochs': 3,
'optim': 'adamw_torch',
'optim_args': None,
'optim_target_modules': None,
'overwrite_output_dir': False,
'packing': False,
'past_index': -1,
'per_device_eval_batch_size': 1,
'per_device_train_batch_size': 1,
'per_gpu_eval_batch_size': None,
'per_gpu_train_batch_size': None,
'prediction_loss_only': False,
'push_to_hub': False,
'push_to_hub_model_id': None,
'push_to_hub_organization': None,
'push_to_hub_token': '<PUSH_TO_HUB_TOKEN>',
'ray_scope': 'last',
'remove_unused_columns': True,
'restore_callback_states_from_checkpoint': False,
'resume_from_checkpoint': None,
'save_on_each_node': False,
'save_only_model': False,
'save_safetensors': True,
'save_steps': 20,
'save_strategy': 'steps',
'save_total_limit': None,
'seed': 42,
'skip_memory_metrics': True,
'smoke_test': False,
'split_batches': None,
'tf32': None,
'torch_compile': False,
'torch_compile_backend': None,
'torch_compile_mode': None,
'torch_dtype': 'bfloat16',
'torch_empty_cache_steps': None,
'torchdynamo': None,
'tpu_metrics_debug': False,
'tpu_num_cores': None,
'use_cpu': False,
'use_ipex': False,
'use_legacy_prediction_loop': False,
'use_liger_kernel': False,
'use_mps_device': False,
'use_peft': False,
'val_set_size': 0.0,
'warmup_ratio': 0.1,
'warmup_steps': 0,
'weight_decay': 0.0}
Training script
def main(cfg):
accelerator = Accelerator()
model_kwargs = dict(
attn_implementation=sft_config.attn_implementation,
torch_dtype=sft_config.torch_dtype,
use_cache=False,
)
model = AutoModelForCausalLM.from_pretrained(sft_config.model_name_or_path, **model_kwargs)
tokenizer = AutoTokenizer.from_pretrained(sft_config.model_name_or_path, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
args=sft_config,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
peft_config=None,
dataset_kwargs=sft_config.dataset_kwargs,
)
trainer.train()
trainer.save_model()
if __name__ == "__main__":
main()
Stack trace
Traceback (most recent call last):
File "/home/ubuntu/***/train.py", line 233, in main
trainer.train()
File "/home/ubuntu/***/.venv/lib/python3.12/site-packages/transformers/trainer.py", line 2164, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/***/.venv/lib/python3.12/site-packages/transformers/trainer.py", line 2522, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/***/.venv/lib/python3.12/site-packages/transformers/trainer.py", line 3653, in training_step
loss = self.compute_loss(model, inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/***/.venv/lib/python3.12/site-packages/transformers/trainer.py", line 3709, in compute_loss
outputs = model(**inputs)
^^^^^^^^^^^^^^^
File "/home/ubuntu/***/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/***/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/***/.venv/lib/python3.12/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 864, in forward
output = self._fsdp_wrapped_module(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/***/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/***/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/***/.venv/lib/python3.12/site-packages/accelerate/utils/operations.py", line 823, in forward
return model_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/***/.venv/lib/python3.12/site-packages/accelerate/utils/operations.py", line 811, in __call__
return convert_to_fp32(self.model_forward(*args, **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/***/.venv/lib/python3.12/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/***/.venv/lib/python3.12/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1184, in forward
loss = self.loss_function(logits, labels, self.vocab_size, **loss_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/***/.venv/lib/python3.12/site-packages/transformers/loss/loss_utils.py", line 36, in ForCausalLMLoss
logits = logits.float()
^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.97 GiB. GPU 5 has a total capacity of 39.38 GiB of which 1.53 GiB is free. Including non-PyTorch memory, this process has 37.84 GiB memory in use. Of the allocated memory 35.69 GiB is allocated by PyTorch, and 521.06 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Expected behavior
Training should complete without errors.