Skip to content

Conversation

@stas00
Copy link
Collaborator

@stas00 stas00 commented Oct 20, 2025

This PR is fixing this:

[rank0]:   File "/code/users/stas/github/DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py", line 985, in grad_handling_hook
[rank0]:     self.process_gradients(param, i)
[rank0]:   File "/code/users/stas/github/DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py", line 1524, in process_gradients
[rank0]:     self.reduce_ready_partitions_and_remove_grads(param, i)
[rank0]:   File "/code/users/stas/github/DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py", line 1528, in reduce_ready_partitions_and_remove_grads
[rank0]:     self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
[rank0]:   File "/code/users/stas/github/DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py", line 1006, in reduce_independent_p_g_buckets_and_remove_grads
[rank0]:     self.report_ipg_memory_usage("In ipg_remove_grads before reduce_ipg_grads", param.numel(), param.dtype)
[rank0]:   File "/code/users/stas/github/DeepSpeed/deepspeed/runtime/base_optimizer.py", line 70, in report_ipg_memory_usage
[rank0]:     bucket = self.ipg_buckets[dt]
[rank0]:              ~~~~~~~~~~~~~~~~^^^^
[rank0]: KeyError: torch.bfloat16

the problem doesn't exist if: seq_parallel_communication_data_type: bf16 is used, but fails with fp32 (or no setting).

In this PR I'm syncing with the z3 implementation which doesn't pass the dtype arg and lets the traversal of existing dtypes do the thing.

def report_ipg_memory_usage(self, tag, param_elems, dtype=None):
dtypes = self.ipg_buckets.keys() if dtype is None else [dtype]
for dt in dtypes:
bucket = self.ipg_buckets[dt]
elem_count = bucket.elements + param_elems
percent_of_bucket_size = (100.0 * elem_count) // self.reduce_bucket_size
see_memory_usage(
f"{tag}: elems in_bucket {dt} {bucket.elements} param {param_elems} max_percent {percent_of_bucket_size}"
)

Fixes: #7607

This PR is fixing this:

```
[rank0]:   File "/code/users/stas/github/DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py", line 985, in grad_handling_hook
[rank0]:     self.process_gradients(param, i)
[rank0]:   File "/code/users/stas/github/DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py", line 1524, in process_gradients
[rank0]:     self.reduce_ready_partitions_and_remove_grads(param, i)
[rank0]:   File "/code/users/stas/github/DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py", line 1528, in reduce_ready_partitions_and_remove_grads
[rank0]:     self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
[rank0]:   File "/code/users/stas/github/DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py", line 1006, in reduce_independent_p_g_buckets_and_remove_grads
[rank0]:     self.report_ipg_memory_usage("In ipg_remove_grads before reduce_ipg_grads", param.numel(), param.dtype)
[rank0]:   File "/code/users/stas/github/DeepSpeed/deepspeed/runtime/base_optimizer.py", line 70, in report_ipg_memory_usage
[rank0]:     bucket = self.ipg_buckets[dt]
[rank0]:              ~~~~~~~~~~~~~~~~^^^^
```

the problem doesn't exist if: `seq_parallel_communication_data_type: bf16` is used, but fails with fp32.

In this PR I'm syncing with the z3 implementation which doesn't pass the `dtype`.

Fixes: #7607
@stas00 stas00 enabled auto-merge (squash) October 20, 2025 20:39
@stas00 stas00 merged commit 9c86cd9 into master Oct 22, 2025
13 checks passed
@stas00 stas00 deleted the stas00-patch-1 branch October 22, 2025 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] ZeRO 2 ipg_buckets key error

3 participants