Open
Description
openedon Nov 26, 2024
Describe the bug
We are using deepspeed
to pre-train a model, but we are getting the following error:
File "/tmp/ray/session_2024-11-26_00-04-12_236521_437/runtime_resources/working_dir_files/_ray_pkg_8abdeac0a60ab9ea/****/model/model.py", line 246, in validation_step
embeddings = self.forward(batch)
File "/tmp/ray/session_2024-11-26_00-04-12_236521_437/runtime_resources/working_dir_files/_ray_pkg_8abdeac0a60ab9ea/****/model/model.py", line 219, in forward
embeddings[channel_name] = self.normalize(encoder(batch[channel_name]))
File "/opt/miniconda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/miniconda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/tmp/ray/session_2024-11-26_00-04-12_236521_437/runtime_resources/working_dir_files/_ray_pkg_8abdeac0a60ab9ea/****/model/encoders/uvnet.py", line 465, in forward
hidden_crv_feat = self.curv_encoder(input_crv_feat)
File "/opt/miniconda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/miniconda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/tmp/ray/session_2024-11-26_00-04-12_236521_437/runtime_resources/working_dir_files/_ray_pkg_8abdeac0a60ab9ea/****/model/encoders/uvnet.py", line 153, in forward
x = self.conv1(x)
File "/opt/miniconda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/miniconda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/miniconda/lib/python3.10/site-packages/torch/nn/modules/container.py", line 217, in forward
input = module(input)
File "/opt/miniconda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/miniconda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/miniconda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 310, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/opt/miniconda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 306, in _conv_forward
return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (CUDABFloat16Type) should be the same
To Reproduce
Proprietary code, so cannot share full code.
Deepspeed Configuration:
deepspeed_configs = {
"zero_allow_untested_optimizer": True,
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {"device": "cpu", "pin_memory": True},
"offload_param": {"device": "cpu", "pin_memory": True},
"overlap_comm": True,
"contiguous_gradients": True,
"allgather_bucket_size": 500000000,
"reduce_bucket_size": 500000000
},
"activation_checkpointing": {
"partition_activations": True,
"cpu_checkpointing": True
}
}
Expected behavior
When we train using DDP, we don't get this error. But as soon as we use deepspeed==0.15.2
, it returns input type & weight type mismatch
ds_report output
Screenshots
If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
- OS: [e.g. Ubuntu 18.04] Ubuntu 22.04
- GPU count and types [e.g. two machines with x8 A100s each] x16 H100s
- Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
- Python version 3.10.1
- Any other relevant info about your setup
Launcher context
Ray Lightning, Ray VM Launcher
Docker context
Are you using a specific docker image that you can share?
Additional context
Add any other context about the problem here.