Falcon model training on multiple GPUs 

### System Info

- `transformers` version: 4.44.0
- Platform: Linux-4.18.0-553.16.1.el8_10.x86_64-x86_64-with-glibc2.28
- Python version: 3.9.4
- Huggingface_hub version: 0.23.2
- Safetensors version: 0.4.3
- Accelerate version: 0.33.0
- Accelerate config:    not found
- PyTorch version (GPU?): 2.4.0+cu121 (True)
- Tensorflow version (GPU?): 2.17.0 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: <fill in>
- Using GPU in script?: <fill in>
- GPU type: Tesla V100-FHHL-16GB


### Who can help?

@ArthurZucker

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

Hello,
I tried one jupyter notebook on multiple model trainings. When i load the model, I use device_map = "auto" to split the model
on multiple (4) GPUs. After that, I use the Trainer and it does parallel training automatically. 
It always works except for Falcon (7b and 11b). For other models, parallel training takes place automatically and all 4 graphics cards are connected.

What should I do please? I am posting part of my code and error message:

```
num_labels = 11  
model = AutoModelForSequenceClassification.from_pretrained("Huggingface_models/Falcon2 11b", num_labels=num_labels,device_map="auto")
```

```
from peft import LoraConfig
lora_config = LoraConfig(
    r = 16,
    lora_alpha = 8, 
    target_modules = "all-linear",
    lora_dropout = 0.05, 
    bias = 'none', 
    task_type = 'SEQ_CLS'
)

from peft import prepare_model_for_kbit_training, get_peft_model
model = prepare_model_for_kbit_training(model)

model = get_peft_model(model, lora_config)
```


```
training_args = TrainingArguments(
    output_dir=".....",
    save_strategy="steps",
    eval_strategy="steps",
    eval_steps=half_steps_per_epoch//2,
    save_steps=half_steps_per_epoch//2,
    learning_rate=1e-4,
    lr_scheduler_type="linear",
    #gradient_accumulation_steps=2,
    #gradient_checkpointing=True,
    per_device_train_batch_size=2,
    #per_device_eval_bat,ch_size=16,
    num_train_epochs=2,            
    weight_decay=0.01,
    #dataloader_num_workers=4,
    #logging_steps=500,
    load_best_model_at_end=True, 
    fp16=True,
    #warmup_ratio=0.1,
    save_total_limit=4,
    #report_to="tensorboard"
)

from transformers import EarlyStoppingCallback

trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=Train_tokenized,
        eval_dataset=Eval_tokenized,
        compute_metrics=compute_metrics,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=1)]
    )

train_result = trainer.train()
```

ERROR::

**RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! (when checking argument for argument target in method wrapper_CUDA_nll_loss_forward)**

### Expected behavior

.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Falcon model training on multiple GPUs #34492

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Falcon model training on multiple GPUs #34492

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions