Exception raised with trainer + `accelerate launch` FSDP + large gradient accumulation steps + small dataset

### System Info

- `transformers` version: 4.44.2
- Platform: Linux-5.15.0-119-generic-x86_64-with-glibc2.35
- Python version: 3.10.14
- Huggingface_hub version: 0.23.4
- Safetensors version: 0.4.3
- Accelerate version: 0.29.2
- Accelerate config:    not found
- PyTorch version (GPU?): 2.2.2+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: yes (accelerate FSDP)
- Using GPU in script?: yes
- GPU type: NVIDIA RTX A6000

### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction

This is a duplicate of #24098 and #25695, but I figured it'd still be useful to resubmit this issue since (1) I have a code example, and (2) I paste a different error message I get with mixed precision, which may increase visibility for other people who run into this problem and search for existing GitHub issues.

When I do multi-GPU training (launched with `accelerate launch --num_processes=2`) using `Trainer` with a small dataset size and `gradient_accumulation_steps > 2`, I often repeatedly get the following error:
```python-traceback
Traceback (most recent call last):
  File "/workspace/program.py", line 34, in <module>
    trainer.train()
  File "/usr/local/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
    return inner_training_loop(
  File "/usr/local/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2341, in _inner_training_loop
    self.optimizer.step()
  File "/usr/local/venv/lib/python3.10/site-packages/accelerate/optimizer.py", line 150, in step
    self.optimizer.step(closure)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/optimizer.py", line 385, in wrapper
    out = func(*args, **kwargs)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/adamw.py", line 187, in step
    adamw(
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/adamw.py", line 339, in adamw
    func(
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/adamw.py", line 549, in _multi_tensor_adamw
    torch._foreach_lerp_(device_exp_avgs, device_grads, 1 - beta1)
RuntimeError: The size of tensor a (3219712) must match the size of tensor b (128) at non-singleton dimension 1
```

If FP16 mixed-precision is enabled then the error looks like this instead:
```python-traceback
Traceback (most recent call last):
  File "/workspace/program.py", line 34, in <module>
    trainer.train()
  File "/usr/local/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
    return inner_training_loop(
  File "/usr/local/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2341, in _inner_training_loop
    self.optimizer.step()
  File "/usr/local/venv/lib/python3.10/site-packages/accelerate/optimizer.py", line 137, in step
    self.scaler.step(self.optimizer, closure)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 457, in step
    retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 352, in _maybe_opt_step
    retval = optimizer.step(*args, **kwargs)
  File "/usr/local/venv/lib/python3.10/site-packages/accelerate/optimizer.py", line 192, in patched_step
    return method(*args, **kwargs)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/optimizer.py", line 385, in wrapper
    out = func(*args, **kwargs)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/adamw.py", line 187, in step
    adamw(
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/adamw.py", line 339, in adamw
    func(
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/adamw.py", line 516, in _multi_tensor_adamw
    grouped_tensors = Optimizer._group_tensors_by_device_and_dtype([
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/optimizer.py", line 409, in _group_tensors_by_device_and_dtype
    return _group_tensors_by_device_and_dtype(tensorlistlist, with_indices)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/utils/_foreach_utils.py", line 38, in _group_tensors_by_device_and_dtype
    torch._C._group_tensors_by_device_and_dtype(tensorlistlist, with_indices).items()
RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32 notwithstanding
```

Here's a minimal example &mdash; run the following with `accelerate launch --config_file=accelerate_config.yaml --num_processes=2 program.py`
```python
# program.py
from datasets import Dataset
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
)

dataset = Dataset.from_dict(
    {"text": ["positive", "negative"], "label": [1, 0]}
)  # tiny dataset of 2 examples

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-14m")
tokenized_dataset = dataset.map(lambda x: tokenizer(x["text"]), batched=True)

model = AutoModelForSequenceClassification.from_pretrained(
    "EleutherAI/pythia-14m", num_labels=2
)
model.config.pad_token_id = tokenizer.eos_token_id

training_args = TrainingArguments(
    output_dir="/tmp/results",
    num_train_epochs=10,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=16,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

trainer.train()
```
```yaml
# accelerate_config.yaml
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: "no"
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: false
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: "no"  # change this to "fp16" to get the other error
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
```


My use case for this was that I had a codebase where we had added some end-to-end tests. We used a very small dataset size since we wanted the test to still be reasonably fast, but then we hit into these exceptions and were confused.

### Expected behavior

I think I expect this to just work without crashing.
But maybe it's not really a sensible setup to have such a small training set. In #24098 commenters suggested that the training set size
> has to be greater than gradient_accumulation_steps * num_GPUs * per_device_train_batch_size.

In that case it would be nice to have an error message saying that this is the problem.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Exception raised with trainer + `accelerate launch` FSDP + large gradient accumulation steps + small dataset #33413

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Exception raised with trainer + accelerate launch FSDP + large gradient accumulation steps + small dataset #33413

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Exception raised with trainer + `accelerate launch` FSDP + large gradient accumulation steps + small dataset #33413