-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Description
Describe the bug
related to #3718
An KeyError is thrown inside deepspeed.initialize at runtime/zero/stage_1_and_2.py", line 574, in _create_param_mapping, due to inconsistent usage of model parameters and parameters managed by the optimizer.
Full Traceback (Click to Show)
```log
Traceback (most recent call last):
File "/home/yuxuan/gitrepos/machine-learning-issues/DS-3718/bug.py", line 53, in
expose_bug()
File "/home/yuxuan/gitrepos/machine-learning-issues/DS-3718/bug.py", line 50, in expose_bug
model_engine, _, _, _ = deepspeed.initialize(model=model, optimizer=optimizer, config_params=ds_config_fp16)
File "/home/yuxuan/miniconda3/envs/ml-daikon/lib/python3.10/site-packages/deepspeed/__init__.py", line 193, in initialize
engine = DeepSpeedEngine(args=args,
File "/home/yuxuan/miniconda3/envs/ml-daikon/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 313, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/home/yuxuan/miniconda3/envs/ml-daikon/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1302, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/home/yuxuan/miniconda3/envs/ml-daikon/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1560, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer(
File "/home/yuxuan/miniconda3/envs/ml-daikon/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 553, in __init__
self._param_slice_mappings = self._create_param_mapping()
File "/home/yuxuan/miniconda3/envs/ml-daikon/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 574, in _create_param_mapping
lp_name = self.param_names[lp]
KeyError: tensor([-0.1179, 0.0190, -0.2361, -0.0330, -0.1097, 0.0685, -0.2788, -0.0424,
0.1805, -0.1327, 0.2295, -0.1572, 0.2908, -0.2656, -0.2262, 0.2296,
-0.0720, -0.1561, -0.0781, -0.2091, 0.1213, 0.2913, 0.0560, -0.1072,
0.0990, -0.2922, -0.1741, -0.0852, 0.1493, 0.0510, -0.3086, 0.0510,
0.0163, 0.1622, -0.1566, 0.2947, -0.1843, 0.2590, 0.1528, -0.0735,
-0.1443, -0.1165, -0.3152, 0.1250, -0.0102, 0.1953, 0.2947, 0.2566,
0.2942, 0.1985, -0.2688, -0.1689, 0.2786, -0.0039, 0.0989, 0.2617,
0.0778, -0.1381, 0.0307, -0.1530, -0.0360, 0.1461, -0.0746, 0.1142,
-0.2235, -0.2544, -0.1772, -0.1136, -0.0287, 0.3057, 0.1761, 0.0782,
0.1699, 0.0997, -0.1385, -0.0923, -0.1219, -0.2313, -0.0925, 0.1703,
0.0032, 0.3071, -0.0467, 0.0065, -0.2251, 0.2949, -0.2507, 0.2847,
-0.0638, -0.1945, 0.1630, 0.2463, -0.0249, -0.0586, -0.1923, -0.0291,
0.0769, 0.2839, 0.0655, -0.3152, 0.2118, -0.0918, 0.0764, 0.2585,
-0.0240, 0.1981, -0.1708, -0.2991, 0.1741, 0.0652, 0.2600, -0.2397,
0.0431, 0.0839, 0.1979, 0.0051, -0.2389, 0.0657, 0.1400, 0.3115,
0.2091, -0.2200, 0.2610, -0.0994, -0.2996, -0.2710, -0.0466, 0.0237,
-0.0053, 0.2881, 0.0077, 0.1194, 0.0026, 0.1493, 0.2361, 0.2915,
0.1206, 0.2198, -0.2668, -0.2032, -0.2406, 0.2112, 0.1519, 0.1636,
-0.1826, -0.2490, 0.2637, 0.2380, -0.2703, -0.2249, 0.0025, 0.1674,
-0.2751, 0.0442, -0.1255, 0.0972, -0.2766, 0.0444, 0.0058, 0.0765,
0.2798, 0.0579, 0.2499, -0.2095, 0.0173, 0.0000], device='cuda:0',
requires_grad=True)
[2024-11-20 13:00:07,916] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 3765479
```
Suspected Root Cause
deepspeed.initialize(model=model, optimizer=optimizer, config_params=ds_config_fp16)
This issue can be triggered in any case if in the arguments to deepspeed.initialize, parameters in optimizer.param_groups is not a subset of model.parameters.
At the fault location, the code is trying to access parameter's names stored in self.param_names using tensors in self.bit16_groups.
self.bit16_groups is populated from optimizer.param_groups, while

self.param_names is populated from the model itself.
Thus, if the optimizer's parameters are not exactly a subset of the model, a KeyError will be thrown.
The case where optimizer's parameters are not exactly a subset of the model is quite common, due to optimization techniques like Parameter Grouping and ZeRO Optimization.
To Reproduce
We prepared a rather simple reproduction script to reproduce this error. In this script, deepspeed.initialize is accidently called twice. After the first deepspeed.initialize, optimizer.param_groups was consolidated into one single parameter, and causes key error in the second deepspeed.initialize.
-
Install deepspeed
0.15.4 -
run
bug.pyusingdeepspeed --num_gpus=1 bug.py
# bug.py
import torch
import deepspeed
import torch.nn as nn
# Define a simple model
class SimpleModel(nn.Module):
def __init__(self):
super(SimpleModel, self).__init__()
self.fc1 = nn.Linear(10, 10)
self.fc2 = nn.Linear(10, 5)
def forward(self, x):
x = self.fc1(x)
return self.fc2(x)
# Main function to expose the bug
def expose_bug():
# Initialize model
model = SimpleModel()
# Initialize DeepSpeed configurations for fp16
ds_config_fp16 = {
"train_micro_batch_size_per_gpu": 1,
"fp16": {"enabled": True,},
"zero_optimization": {"stage": 2}
}
optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=1e-3)
# optimizer have 4 params now
print(optimizer.param_groups)
# Initialize DeepSpeed engine
model_engine, optim, _, _ = deepspeed.initialize(model=model, optimizer=optimizer, config_params=ds_config_fp16)
# optimizer have 1 params now
print(optimizer.param_groups)
# EXCEPTION!!!
model_engine, optim, _, _ = deepspeed.initialize(model=model, optimizer=optimizer, config_params=ds_config_fp16)
if __name__ == "__main__":
expose_bug()-
Notice that the second
deepspeed.initializethrows theKeyErrorexception. -
Also notice that the first print of
optimizer.param_groupsshows 4 params, while the second print shows only one param (the content of one param is the merge of the 4 param).
Since in the second deepspeed.initialize, the merged param actually does not exist in the model, an KeyError will be thrown.
Expected behavior / Suggested Fix
We expect two behaviors here from DeepSpeed
- Forbid
deepspeed.initializeon models / optimizers that have already been used in anotherdeepspeed.initialize. - Check for "
optimizer.param_groupshould be a subset ofmodel.parameters()" explicitly and throw a more user-friendly exception or warning.
ds_report output
Click to Show
collect2: error: ld returned 1 exit status
gds .................... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
[WARNING] using untested triton version (2.2.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/xxx/python3.10/site-packages/torch']
torch version .................... 2.2.2+cu121
deepspeed install path ........... ['/home/xxx/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.15.4, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.3
deepspeed wheel compiled w. ...... torch 2.2, cuda 12.1
shared memory (/dev/shm) size .... 31.24 GB
I will be more than happy to contribute to the two suggested fixes, let me know what you think!



