Skip to content

[BUG] [Fix-Suggested] KeyError in stage_1_and_2.py Due to Optimizer-Model Parameter Mismatch #6770

@traincheck-team

Description

@traincheck-team

Describe the bug

related to #3718

An KeyError is thrown inside deepspeed.initialize at runtime/zero/stage_1_and_2.py", line 574, in _create_param_mapping, due to inconsistent usage of model parameters and parameters managed by the optimizer.

Full Traceback (Click to Show)
```log
Traceback (most recent call last):
  File "/home/yuxuan/gitrepos/machine-learning-issues/DS-3718/bug.py", line 53, in 
    expose_bug()
  File "/home/yuxuan/gitrepos/machine-learning-issues/DS-3718/bug.py", line 50, in expose_bug
    model_engine, _, _, _ = deepspeed.initialize(model=model, optimizer=optimizer, config_params=ds_config_fp16)
  File "/home/yuxuan/miniconda3/envs/ml-daikon/lib/python3.10/site-packages/deepspeed/__init__.py", line 193, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/yuxuan/miniconda3/envs/ml-daikon/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 313, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/yuxuan/miniconda3/envs/ml-daikon/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1302, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/home/yuxuan/miniconda3/envs/ml-daikon/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1560, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer(
  File "/home/yuxuan/miniconda3/envs/ml-daikon/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 553, in __init__
    self._param_slice_mappings = self._create_param_mapping()
  File "/home/yuxuan/miniconda3/envs/ml-daikon/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 574, in _create_param_mapping
    lp_name = self.param_names[lp]
KeyError: tensor([-0.1179,  0.0190, -0.2361, -0.0330, -0.1097,  0.0685, -0.2788, -0.0424,
         0.1805, -0.1327,  0.2295, -0.1572,  0.2908, -0.2656, -0.2262,  0.2296,
        -0.0720, -0.1561, -0.0781, -0.2091,  0.1213,  0.2913,  0.0560, -0.1072,
         0.0990, -0.2922, -0.1741, -0.0852,  0.1493,  0.0510, -0.3086,  0.0510,
         0.0163,  0.1622, -0.1566,  0.2947, -0.1843,  0.2590,  0.1528, -0.0735,
        -0.1443, -0.1165, -0.3152,  0.1250, -0.0102,  0.1953,  0.2947,  0.2566,
         0.2942,  0.1985, -0.2688, -0.1689,  0.2786, -0.0039,  0.0989,  0.2617,
         0.0778, -0.1381,  0.0307, -0.1530, -0.0360,  0.1461, -0.0746,  0.1142,
        -0.2235, -0.2544, -0.1772, -0.1136, -0.0287,  0.3057,  0.1761,  0.0782,
         0.1699,  0.0997, -0.1385, -0.0923, -0.1219, -0.2313, -0.0925,  0.1703,
         0.0032,  0.3071, -0.0467,  0.0065, -0.2251,  0.2949, -0.2507,  0.2847,
        -0.0638, -0.1945,  0.1630,  0.2463, -0.0249, -0.0586, -0.1923, -0.0291,
         0.0769,  0.2839,  0.0655, -0.3152,  0.2118, -0.0918,  0.0764,  0.2585,
        -0.0240,  0.1981, -0.1708, -0.2991,  0.1741,  0.0652,  0.2600, -0.2397,
         0.0431,  0.0839,  0.1979,  0.0051, -0.2389,  0.0657,  0.1400,  0.3115,
         0.2091, -0.2200,  0.2610, -0.0994, -0.2996, -0.2710, -0.0466,  0.0237,
        -0.0053,  0.2881,  0.0077,  0.1194,  0.0026,  0.1493,  0.2361,  0.2915,
         0.1206,  0.2198, -0.2668, -0.2032, -0.2406,  0.2112,  0.1519,  0.1636,
        -0.1826, -0.2490,  0.2637,  0.2380, -0.2703, -0.2249,  0.0025,  0.1674,
        -0.2751,  0.0442, -0.1255,  0.0972, -0.2766,  0.0444,  0.0058,  0.0765,
         0.2798,  0.0579,  0.2499, -0.2095,  0.0173,  0.0000], device='cuda:0',
       requires_grad=True)
[2024-11-20 13:00:07,916] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 3765479
```

Suspected Root Cause

deepspeed.initialize(model=model, optimizer=optimizer, config_params=ds_config_fp16)

This issue can be triggered in any case if in the arguments to deepspeed.initialize, parameters in optimizer.param_groups is not a subset of model.parameters.

At the fault location, the code is trying to access parameter's names stored in self.param_names using tensors in self.bit16_groups.

location-of-the-bug

self.bit16_groups is populated from optimizer.param_groups, while
bit16-params-is-from-optimizer

self.param_names is populated from the model itself.

param-names-source-is-the-model

Thus, if the optimizer's parameters are not exactly a subset of the model, a KeyError will be thrown.
The case where optimizer's parameters are not exactly a subset of the model is quite common, due to optimization techniques like Parameter Grouping and ZeRO Optimization.

To Reproduce

We prepared a rather simple reproduction script to reproduce this error. In this script, deepspeed.initialize is accidently called twice. After the first deepspeed.initialize, optimizer.param_groups was consolidated into one single parameter, and causes key error in the second deepspeed.initialize.

  1. Install deepspeed 0.15.4

  2. run bug.py using deepspeed --num_gpus=1 bug.py

# bug.py
import torch
import deepspeed
import torch.nn as nn

# Define a simple model
class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.fc1 = nn.Linear(10, 10)
        self.fc2 = nn.Linear(10, 5)
    def forward(self, x):
        x = self.fc1(x)
        return self.fc2(x)

# Main function to expose the bug
def expose_bug():
    # Initialize model
    model = SimpleModel()
    # Initialize DeepSpeed configurations for fp16
    ds_config_fp16 = {
        "train_micro_batch_size_per_gpu": 1,
        "fp16": {"enabled": True,},
        "zero_optimization": {"stage": 2}
    }

    optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=1e-3)
    # optimizer have 4 params now
    print(optimizer.param_groups)
    # Initialize DeepSpeed engine
    model_engine, optim, _, _ = deepspeed.initialize(model=model, optimizer=optimizer, config_params=ds_config_fp16)
    # optimizer have 1 params now
    print(optimizer.param_groups)
    # EXCEPTION!!!
    model_engine, optim, _, _ = deepspeed.initialize(model=model, optimizer=optimizer, config_params=ds_config_fp16)

if __name__ == "__main__":
    expose_bug()
  1. Notice that the second deepspeed.initialize throws the KeyError exception.

  2. Also notice that the first print of optimizer.param_groups shows 4 params, while the second print shows only one param (the content of one param is the merge of the 4 param).

prior to deepspeed.initialize
params-not-merged

After deepspeed.initialize
merged-params

Since in the second deepspeed.initialize, the merged param actually does not exist in the model, an KeyError will be thrown.

Expected behavior / Suggested Fix

We expect two behaviors here from DeepSpeed

  1. Forbid deepspeed.initialize on models / optimizers that have already been used in another deepspeed.initialize.
  2. Check for "optimizer.param_group should be a subset of model.parameters()" explicitly and throw a more user-friendly exception or warning.

ds_report output

Click to Show
collect2: error: ld returned 1 exit status
gds .................... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/xxx/python3.10/site-packages/torch']
torch version .................... 2.2.2+cu121
deepspeed install path ........... ['/home/xxx/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.15.4, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.3
deepspeed wheel compiled w. ...... torch 2.2, cuda 12.1
shared memory (/dev/shm) size .... 31.24 GB

I will be more than happy to contribute to the two suggested fixes, let me know what you think!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions