z2: don't pass `dtype` to `report_ipg_memory_usage` #7636

stas00 · 2025-10-20T19:49:41Z

This PR is fixing this:

[rank0]:   File "/code/users/stas/github/DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py", line 985, in grad_handling_hook
[rank0]:     self.process_gradients(param, i)
[rank0]:   File "/code/users/stas/github/DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py", line 1524, in process_gradients
[rank0]:     self.reduce_ready_partitions_and_remove_grads(param, i)
[rank0]:   File "/code/users/stas/github/DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py", line 1528, in reduce_ready_partitions_and_remove_grads
[rank0]:     self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
[rank0]:   File "/code/users/stas/github/DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py", line 1006, in reduce_independent_p_g_buckets_and_remove_grads
[rank0]:     self.report_ipg_memory_usage("In ipg_remove_grads before reduce_ipg_grads", param.numel(), param.dtype)
[rank0]:   File "/code/users/stas/github/DeepSpeed/deepspeed/runtime/base_optimizer.py", line 70, in report_ipg_memory_usage
[rank0]:     bucket = self.ipg_buckets[dt]
[rank0]:              ~~~~~~~~~~~~~~~~^^^^
[rank0]: KeyError: torch.bfloat16

the problem doesn't exist if: seq_parallel_communication_data_type: bf16 is used, but fails with fp32 (or no setting).

In this PR I'm syncing with the z3 implementation which doesn't pass the dtype arg and lets the traversal of existing dtypes do the thing.

DeepSpeed/deepspeed/runtime/base_optimizer.py

Lines 66 to 75 in 407708c

    
           def report_ipg_memory_usage(self, tag, param_elems, dtype=None): 
        
               dtypes = self.ipg_buckets.keys() if dtype is None else [dtype] 
        
               for dt in dtypes: 
        
                   bucket = self.ipg_buckets[dt] 
        
                   elem_count = bucket.elements + param_elems 
        
                   percent_of_bucket_size = (100.0 * elem_count) // self.reduce_bucket_size 
        
                   see_memory_usage( 
        
                       f"{tag}: elems in_bucket {dt} {bucket.elements} param {param_elems} max_percent {percent_of_bucket_size}" 
        
                   )

Fixes: #7607

This PR is fixing this: ``` [rank0]: File "/code/users/stas/github/DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py", line 985, in grad_handling_hook [rank0]: self.process_gradients(param, i) [rank0]: File "/code/users/stas/github/DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py", line 1524, in process_gradients [rank0]: self.reduce_ready_partitions_and_remove_grads(param, i) [rank0]: File "/code/users/stas/github/DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py", line 1528, in reduce_ready_partitions_and_remove_grads [rank0]: self.reduce_independent_p_g_buckets_and_remove_grads(param, i) [rank0]: File "/code/users/stas/github/DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py", line 1006, in reduce_independent_p_g_buckets_and_remove_grads [rank0]: self.report_ipg_memory_usage("In ipg_remove_grads before reduce_ipg_grads", param.numel(), param.dtype) [rank0]: File "/code/users/stas/github/DeepSpeed/deepspeed/runtime/base_optimizer.py", line 70, in report_ipg_memory_usage [rank0]: bucket = self.ipg_buckets[dt] [rank0]: ~~~~~~~~~~~~~~~~^^^^ ``` the problem doesn't exist if: `seq_parallel_communication_data_type: bf16` is used, but fails with fp32. In this PR I'm syncing with the z3 implementation which doesn't pass the `dtype`. Fixes: #7607

stas00 requested review from tjruwase and tohtana as code owners October 20, 2025 19:49

stas00 enabled auto-merge (squash) October 20, 2025 20:39

sfc-gh-truwase approved these changes Oct 22, 2025

View reviewed changes

sfc-gh-truwase and others added 2 commits October 22, 2025 07:39

Merge branch 'master' into stas00-patch-1

34e8dd4

Merge branch 'master' into stas00-patch-1

ff5235c

stas00 merged commit 9c86cd9 into master Oct 22, 2025
13 checks passed

stas00 deleted the stas00-patch-1 branch October 22, 2025 17:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

z2: don't pass `dtype` to `report_ipg_memory_usage` #7636

z2: don't pass `dtype` to `report_ipg_memory_usage` #7636

Uh oh!

stas00 commented Oct 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	def report_ipg_memory_usage(self, tag, param_elems, dtype=None):
	dtypes = self.ipg_buckets.keys() if dtype is None else [dtype]

	for dt in dtypes:
	bucket = self.ipg_buckets[dt]
	elem_count = bucket.elements + param_elems
	percent_of_bucket_size = (100.0 * elem_count) // self.reduce_bucket_size
	see_memory_usage(
	f"{tag}: elems in_bucket {dt} {bucket.elements} param {param_elems} max_percent {percent_of_bucket_size}"
	)

z2: don't pass dtype to report_ipg_memory_usage #7636

z2: don't pass dtype to report_ipg_memory_usage #7636

Uh oh!

Conversation

stas00 commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

z2: don't pass `dtype` to `report_ipg_memory_usage` #7636

z2: don't pass `dtype` to `report_ipg_memory_usage` #7636

stas00 commented Oct 20, 2025 •

edited

Loading