[BUG] [Fix-Suggested] KeyError in stage_1_and_2.py Due to Optimizer-Model Parameter Mismatch

## Describe the bug

related to https://github.com/microsoft/DeepSpeed/issues/3718

An KeyError is thrown inside `deepspeed.initialize` at [`runtime/zero/stage_1_and_2.py", line 574, in _create_param_mapping`](https://github.com/microsoft/DeepSpeed/blob/83e4364fbd921ad4fe808d31eb107ef080228a7a/deepspeed/runtime/zero/stage_1_and_2.py#L574), due to inconsistent usage of model parameters and parameters managed by the optimizer.

<details>
<summary>Full Traceback (Click to Show)</summary>
 
<pre><code>```log
Traceback (most recent call last):
 File "/home/yuxuan/gitrepos/machine-learning-issues/DS-3718/bug.py", line 53, in <module>
 expose_bug()
 File "/home/yuxuan/gitrepos/machine-learning-issues/DS-3718/bug.py", line 50, in expose_bug
 model_engine, _, _, _ = deepspeed.initialize(model=model, optimizer=optimizer, config_params=ds_config_fp16)
 File "/home/yuxuan/miniconda3/envs/ml-daikon/lib/python3.10/site-packages/deepspeed/__init__.py", line 193, in initialize
 engine = DeepSpeedEngine(args=args,
 File "/home/yuxuan/miniconda3/envs/ml-daikon/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 313, in __init__
 self._configure_optimizer(optimizer, model_parameters)
 File "/home/yuxuan/miniconda3/envs/ml-daikon/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1302, in _configure_optimizer
 self.optimizer = self._configure_zero_optimizer(basic_optimizer)
 File "/home/yuxuan/miniconda3/envs/ml-daikon/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1560, in _configure_zero_optimizer
 optimizer = DeepSpeedZeroOptimizer(
 File "/home/yuxuan/miniconda3/envs/ml-daikon/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 553, in __init__
 self._param_slice_mappings = self._create_param_mapping()
 File "/home/yuxuan/miniconda3/envs/ml-daikon/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 574, in _create_param_mapping
 lp_name = self.param_names[lp]
KeyError: tensor([-0.1179, 0.0190, -0.2361, -0.0330, -0.1097, 0.0685, -0.2788, -0.0424,
 0.1805, -0.1327, 0.2295, -0.1572, 0.2908, -0.2656, -0.2262, 0.2296,
 -0.0720, -0.1561, -0.0781, -0.2091, 0.1213, 0.2913, 0.0560, -0.1072,
 0.0990, -0.2922, -0.1741, -0.0852, 0.1493, 0.0510, -0.3086, 0.0510,
 0.0163, 0.1622, -0.1566, 0.2947, -0.1843, 0.2590, 0.1528, -0.0735,
 -0.1443, -0.1165, -0.3152, 0.1250, -0.0102, 0.1953, 0.2947, 0.2566,
 0.2942, 0.1985, -0.2688, -0.1689, 0.2786, -0.0039, 0.0989, 0.2617,
 0.0778, -0.1381, 0.0307, -0.1530, -0.0360, 0.1461, -0.0746, 0.1142,
 -0.2235, -0.2544, -0.1772, -0.1136, -0.0287, 0.3057, 0.1761, 0.0782,
 0.1699, 0.0997, -0.1385, -0.0923, -0.1219, -0.2313, -0.0925, 0.1703,
 0.0032, 0.3071, -0.0467, 0.0065, -0.2251, 0.2949, -0.2507, 0.2847,
 -0.0638, -0.1945, 0.1630, 0.2463, -0.0249, -0.0586, -0.1923, -0.0291,
 0.0769, 0.2839, 0.0655, -0.3152, 0.2118, -0.0918, 0.0764, 0.2585,
 -0.0240, 0.1981, -0.1708, -0.2991, 0.1741, 0.0652, 0.2600, -0.2397,
 0.0431, 0.0839, 0.1979, 0.0051, -0.2389, 0.0657, 0.1400, 0.3115,
 0.2091, -0.2200, 0.2610, -0.0994, -0.2996, -0.2710, -0.0466, 0.0237,
 -0.0053, 0.2881, 0.0077, 0.1194, 0.0026, 0.1493, 0.2361, 0.2915,
 0.1206, 0.2198, -0.2668, -0.2032, -0.2406, 0.2112, 0.1519, 0.1636,
 -0.1826, -0.2490, 0.2637, 0.2380, -0.2703, -0.2249, 0.0025, 0.1674,
 -0.2751, 0.0442, -0.1255, 0.0972, -0.2766, 0.0444, 0.0058, 0.0765,
 0.2798, 0.0579, 0.2499, -0.2095, 0.0173, 0.0000], device='cuda:0',
 requires_grad=True)
[2024-11-20 13:00:07,916] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 3765479
```
</code></pre>
 
</details>

## Suspected Root Cause

`deepspeed.initialize(model=model, optimizer=optimizer, config_params=ds_config_fp16)`

This issue can be triggered in any case if in the arguments to `deepspeed.initialize`, parameters in `optimizer.param_groups` is not a subset of `model.parameters`. 

At the fault location, the code is trying to access parameter's names stored in `self.param_names` using tensors in `self.bit16_groups`.

![location-of-the-bug](https://github.com/user-attachments/assets/f922d86d-1942-443b-b047-a41b58530c23)

`self.bit16_groups` is populated from `optimizer.param_groups`, while
![bit16-params-is-from-optimizer](https://github.com/user-attachments/assets/fb40a185-df9e-4bb1-a6f2-3c86d14d764f)

`self.param_names` is populated from the model itself.

![param-names-source-is-the-model](https://github.com/user-attachments/assets/7f079fc0-ddd1-4a67-a890-87aa2a20bf32)

Thus, if the optimizer's parameters are not exactly a subset of the model, a `KeyError` will be thrown. 
**The case where optimizer's parameters are not exactly a subset of the model is quite common, due to optimization techniques like *Parameter Grouping* and *ZeRO Optimization***.

## To Reproduce
We prepared a rather simple reproduction script to reproduce this error. In this script, `deepspeed.initialize` is accidently called twice. After the first `deepspeed.initialize`, `optimizer.param_groups` was consolidated into one single parameter, and causes key error in the second `deepspeed.initialize`.

1. Install deepspeed `0.15.4`

2. run `bug.py` using `deepspeed --num_gpus=1 bug.py`
```python3
# bug.py
import torch
import deepspeed
import torch.nn as nn

# Define a simple model
class SimpleModel(nn.Module):
 def __init__(self):
 super(SimpleModel, self).__init__()
 self.fc1 = nn.Linear(10, 10)
 self.fc2 = nn.Linear(10, 5)
 def forward(self, x):
 x = self.fc1(x)
 return self.fc2(x)

# Main function to expose the bug
def expose_bug():
 # Initialize model
 model = SimpleModel()
 # Initialize DeepSpeed configurations for fp16
 ds_config_fp16 = {
 "train_micro_batch_size_per_gpu": 1,
 "fp16": {"enabled": True,},
 "zero_optimization": {"stage": 2}
 }

 optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=1e-3)
 # optimizer have 4 params now
 print(optimizer.param_groups)
 # Initialize DeepSpeed engine
 model_engine, optim, _, _ = deepspeed.initialize(model=model, optimizer=optimizer, config_params=ds_config_fp16)
 # optimizer have 1 params now
 print(optimizer.param_groups)
 # EXCEPTION!!!
 model_engine, optim, _, _ = deepspeed.initialize(model=model, optimizer=optimizer, config_params=ds_config_fp16)

if __name__ == "__main__":
 expose_bug()
```

3. Notice that the second `deepspeed.initialize` throws the `KeyError` exception.

4. Also notice that the first print of `optimizer.param_groups` shows 4 params, while the second print shows only one param (the content of one param is the merge of the 4 param).

*prior to `deepspeed.initialize`*
![params-not-merged](https://github.com/user-attachments/assets/cb170248-d2df-4103-bb3c-f883a573415e)

*After `deepspeed.initialize`*
![merged-params](https://github.com/user-attachments/assets/ebbeaf84-d419-4aa0-a958-026d83b5f53b)

Since in the second `deepspeed.initialize`, the merged param actually does not exist in the model, an KeyError will be thrown.

## Expected behavior / Suggested Fix
We expect two behaviors here from DeepSpeed
1. Forbid `deepspeed.initialize` on models / optimizers that have already been used in another `deepspeed.initialize`.
2. Check for "`optimizer.param_group` should be a subset of `model.parameters()`" explicitly and throw a more user-friendly exception or warning.

## ds_report output


<details>
<summary>Click to Show</summary>
 
<pre><code>collect2: error: ld returned 1 exit status
gds .................... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING] using untested triton version (2.2.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/xxx/python3.10/site-packages/torch']
torch version .................... 2.2.2+cu121
deepspeed install path ........... ['/home/xxx/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.15.4, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.3
deepspeed wheel compiled w. ...... torch 2.2, cuda 12.1
shared memory (/dev/shm) size .... 31.24 GB
</code></pre>
</details>

**I will be more than happy to contribute to the two suggested fixes, let me know what you think!**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] [Fix-Suggested] KeyError in stage_1_and_2.py Due to Optimizer-Model Parameter Mismatch #6770

Describe the bug

Suspected Root Cause

To Reproduce

Expected behavior / Suggested Fix

ds_report output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] [Fix-Suggested] KeyError in stage_1_and_2.py Due to Optimizer-Model Parameter Mismatch #6770

Description

Describe the bug

Suspected Root Cause

To Reproduce

Expected behavior / Suggested Fix

ds_report output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions