-
Notifications
You must be signed in to change notification settings - Fork 31.5k
Description
System Info
transformersversion: 4.44.2- Platform: Linux-5.15.0-119-generic-x86_64-with-glibc2.35
- Python version: 3.10.14
- Huggingface_hub version: 0.23.4
- Safetensors version: 0.4.3
- Accelerate version: 0.29.2
- Accelerate config: not found
- PyTorch version (GPU?): 2.2.2+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: yes (accelerate FSDP)
- Using GPU in script?: yes
- GPU type: NVIDIA RTX A6000
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
This is a duplicate of #24098 and #25695, but I figured it'd still be useful to resubmit this issue since (1) I have a code example, and (2) I paste a different error message I get with mixed precision, which may increase visibility for other people who run into this problem and search for existing GitHub issues.
When I do multi-GPU training (launched with accelerate launch --num_processes=2) using Trainer with a small dataset size and gradient_accumulation_steps > 2, I often repeatedly get the following error:
Traceback (most recent call last):
File "/workspace/program.py", line 34, in <module>
trainer.train()
File "/usr/local/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
return inner_training_loop(
File "/usr/local/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2341, in _inner_training_loop
self.optimizer.step()
File "/usr/local/venv/lib/python3.10/site-packages/accelerate/optimizer.py", line 150, in step
self.optimizer.step(closure)
File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
return wrapped(*args, **kwargs)
File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/optimizer.py", line 385, in wrapper
out = func(*args, **kwargs)
File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
ret = func(self, *args, **kwargs)
File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/adamw.py", line 187, in step
adamw(
File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/adamw.py", line 339, in adamw
func(
File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/adamw.py", line 549, in _multi_tensor_adamw
torch._foreach_lerp_(device_exp_avgs, device_grads, 1 - beta1)
RuntimeError: The size of tensor a (3219712) must match the size of tensor b (128) at non-singleton dimension 1If FP16 mixed-precision is enabled then the error looks like this instead:
Traceback (most recent call last):
File "/workspace/program.py", line 34, in <module>
trainer.train()
File "/usr/local/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
return inner_training_loop(
File "/usr/local/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2341, in _inner_training_loop
self.optimizer.step()
File "/usr/local/venv/lib/python3.10/site-packages/accelerate/optimizer.py", line 137, in step
self.scaler.step(self.optimizer, closure)
File "/usr/local/venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 457, in step
retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
File "/usr/local/venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 352, in _maybe_opt_step
retval = optimizer.step(*args, **kwargs)
File "/usr/local/venv/lib/python3.10/site-packages/accelerate/optimizer.py", line 192, in patched_step
return method(*args, **kwargs)
File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
return wrapped(*args, **kwargs)
File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/optimizer.py", line 385, in wrapper
out = func(*args, **kwargs)
File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
ret = func(self, *args, **kwargs)
File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/adamw.py", line 187, in step
adamw(
File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/adamw.py", line 339, in adamw
func(
File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/adamw.py", line 516, in _multi_tensor_adamw
grouped_tensors = Optimizer._group_tensors_by_device_and_dtype([
File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/optimizer.py", line 409, in _group_tensors_by_device_and_dtype
return _group_tensors_by_device_and_dtype(tensorlistlist, with_indices)
File "/usr/local/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/venv/lib/python3.10/site-packages/torch/utils/_foreach_utils.py", line 38, in _group_tensors_by_device_and_dtype
torch._C._group_tensors_by_device_and_dtype(tensorlistlist, with_indices).items()
RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32 notwithstandingHere's a minimal example — run the following with accelerate launch --config_file=accelerate_config.yaml --num_processes=2 program.py
# program.py
from datasets import Dataset
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
Trainer,
TrainingArguments,
)
dataset = Dataset.from_dict(
{"text": ["positive", "negative"], "label": [1, 0]}
) # tiny dataset of 2 examples
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-14m")
tokenized_dataset = dataset.map(lambda x: tokenizer(x["text"]), batched=True)
model = AutoModelForSequenceClassification.from_pretrained(
"EleutherAI/pythia-14m", num_labels=2
)
model.config.pad_token_id = tokenizer.eos_token_id
training_args = TrainingArguments(
output_dir="/tmp/results",
num_train_epochs=10,
per_device_train_batch_size=2,
gradient_accumulation_steps=16,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
)
trainer.train()# accelerate_config.yaml
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: "no"
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_forward_prefetch: false
fsdp_offload_params: false
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: "no" # change this to "fp16" to get the other error
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: falseMy use case for this was that I had a codebase where we had added some end-to-end tests. We used a very small dataset size since we wanted the test to still be reasonably fast, but then we hit into these exceptions and were confused.
Expected behavior
I think I expect this to just work without crashing.
But maybe it's not really a sensible setup to have such a small training set. In #24098 commenters suggested that the training set size
has to be greater than gradient_accumulation_steps * num_GPUs * per_device_train_batch_size.
In that case it would be nice to have an error message saying that this is the problem.