Closed
Description
System Info
transformers
version: 4.44.0- Platform: Linux-4.18.0-553.16.1.el8_10.x86_64-x86_64-with-glibc2.28
- Python version: 3.9.4
- Huggingface_hub version: 0.23.2
- Safetensors version: 0.4.3
- Accelerate version: 0.33.0
- Accelerate config: not found
- PyTorch version (GPU?): 2.4.0+cu121 (True)
- Tensorflow version (GPU?): 2.17.0 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?:
- Using GPU in script?:
- GPU type: Tesla V100-FHHL-16GB
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Hello,
I tried one jupyter notebook on multiple model trainings. When i load the model, I use device_map = "auto" to split the model
on multiple (4) GPUs. After that, I use the Trainer and it does parallel training automatically.
It always works except for Falcon (7b and 11b). For other models, parallel training takes place automatically and all 4 graphics cards are connected.
What should I do please? I am posting part of my code and error message:
num_labels = 11
model = AutoModelForSequenceClassification.from_pretrained("Huggingface_models/Falcon2 11b", num_labels=num_labels,device_map="auto")
from peft import LoraConfig
lora_config = LoraConfig(
r = 16,
lora_alpha = 8,
target_modules = "all-linear",
lora_dropout = 0.05,
bias = 'none',
task_type = 'SEQ_CLS'
)
from peft import prepare_model_for_kbit_training, get_peft_model
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
training_args = TrainingArguments(
output_dir=".....",
save_strategy="steps",
eval_strategy="steps",
eval_steps=half_steps_per_epoch//2,
save_steps=half_steps_per_epoch//2,
learning_rate=1e-4,
lr_scheduler_type="linear",
#gradient_accumulation_steps=2,
#gradient_checkpointing=True,
per_device_train_batch_size=2,
#per_device_eval_bat,ch_size=16,
num_train_epochs=2,
weight_decay=0.01,
#dataloader_num_workers=4,
#logging_steps=500,
load_best_model_at_end=True,
fp16=True,
#warmup_ratio=0.1,
save_total_limit=4,
#report_to="tensorboard"
)
from transformers import EarlyStoppingCallback
trainer = Trainer(
model=model,
args=training_args,
train_dataset=Train_tokenized,
eval_dataset=Eval_tokenized,
compute_metrics=compute_metrics,
callbacks=[EarlyStoppingCallback(early_stopping_patience=1)]
)
train_result = trainer.train()
ERROR::
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! (when checking argument for argument target in method wrapper_CUDA_nll_loss_forward)
Expected behavior
.