Skip to content

Falcon model training on multiple GPUs  #34492

Closed
@BigDataMLexplorer

Description

@BigDataMLexplorer

System Info

  • transformers version: 4.44.0
  • Platform: Linux-4.18.0-553.16.1.el8_10.x86_64-x86_64-with-glibc2.28
  • Python version: 3.9.4
  • Huggingface_hub version: 0.23.2
  • Safetensors version: 0.4.3
  • Accelerate version: 0.33.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.4.0+cu121 (True)
  • Tensorflow version (GPU?): 2.17.0 (True)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?:
  • Using GPU in script?:
  • GPU type: Tesla V100-FHHL-16GB

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Hello,
I tried one jupyter notebook on multiple model trainings. When i load the model, I use device_map = "auto" to split the model
on multiple (4) GPUs. After that, I use the Trainer and it does parallel training automatically.
It always works except for Falcon (7b and 11b). For other models, parallel training takes place automatically and all 4 graphics cards are connected.

What should I do please? I am posting part of my code and error message:

num_labels = 11  
model = AutoModelForSequenceClassification.from_pretrained("Huggingface_models/Falcon2 11b", num_labels=num_labels,device_map="auto")
from peft import LoraConfig
lora_config = LoraConfig(
    r = 16,
    lora_alpha = 8, 
    target_modules = "all-linear",
    lora_dropout = 0.05, 
    bias = 'none', 
    task_type = 'SEQ_CLS'
)

from peft import prepare_model_for_kbit_training, get_peft_model
model = prepare_model_for_kbit_training(model)

model = get_peft_model(model, lora_config)
training_args = TrainingArguments(
    output_dir=".....",
    save_strategy="steps",
    eval_strategy="steps",
    eval_steps=half_steps_per_epoch//2,
    save_steps=half_steps_per_epoch//2,
    learning_rate=1e-4,
    lr_scheduler_type="linear",
    #gradient_accumulation_steps=2,
    #gradient_checkpointing=True,
    per_device_train_batch_size=2,
    #per_device_eval_bat,ch_size=16,
    num_train_epochs=2,            
    weight_decay=0.01,
    #dataloader_num_workers=4,
    #logging_steps=500,
    load_best_model_at_end=True, 
    fp16=True,
    #warmup_ratio=0.1,
    save_total_limit=4,
    #report_to="tensorboard"
)

from transformers import EarlyStoppingCallback

trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=Train_tokenized,
        eval_dataset=Eval_tokenized,
        compute_metrics=compute_metrics,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=1)]
    )

train_result = trainer.train()

ERROR::

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! (when checking argument for argument target in method wrapper_CUDA_nll_loss_forward)

Expected behavior

.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions