feat: add parameter-level precision control for BF16 training #7750

leejianwoo-collab · 2025-12-30T06:40:46Z

fix: #7747
Hi DeepSpeed team,

I hope this PR finds you well. I'm submitting this fix to resolve issue #7747, which addresses the problem where MoE router parameters are forcibly cast to bf16 under DeepSpeed bf16 configuration, causing dtype mismatch in fp32 routing logic.

Key changes:

Added should_preserve_dtype() helper function to check parameter preservation flags
Enhanced parameter processing in _setup_for_real_optimizer() to handle mixed precision scenarios
Updated storage management in _update_storage_to_flattened_tensor() to preserve original data types
Included comprehensive tests and documentation
Usage:
Users can now mark specific parameters with param.preserve_dtype = True to maintain their original precision while still benefiting from BF16 mixed precision training for other parameters.

This solution is backward compatible and provides an official mechanism for handling numerically sensitive modules like MoE routers. I've tested this thoroughly and believe it will be valuable for users facing similar precision-related issues.

Looking forward to your feedback and review. Thank you for your time and consideration!

Best regards,

tohtana · 2026-01-01T01:02:38Z

Hi @leejianwoo-collab,
This PR removes a lot of code in the BF16 optimizer, and I’m concerned it may break existing DeepSpeed features. Could you keep the current behavior?
I agree that adding parameter-level precision control is very important. If you think we should deprecate some features as part of this PR, we’re happy to discuss.

feat: add parameter-level precision control for BF16 training

e5ce436

leejianwoo-collab requested review from tjruwase and tohtana as code owners December 30, 2025 06:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add parameter-level precision control for BF16 training #7750

feat: add parameter-level precision control for BF16 training #7750

Uh oh!

leejianwoo-collab commented Dec 30, 2025

Uh oh!

tohtana commented Jan 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add parameter-level precision control for BF16 training #7750

Are you sure you want to change the base?

feat: add parameter-level precision control for BF16 training #7750

Uh oh!

Conversation

leejianwoo-collab commented Dec 30, 2025

Uh oh!

tohtana commented Jan 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants