[BUG] RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (CUDABFloat16Type) should be the same

**Describe the bug**

We are using `deepspeed` to pre-train a model, but we are getting the following error:
```
 File "/tmp/ray/session_2024-11-26_00-04-12_236521_437/runtime_resources/working_dir_files/_ray_pkg_8abdeac0a60ab9ea/****/model/model.py", line 246, in validation_step
    embeddings = self.forward(batch)
  File "/tmp/ray/session_2024-11-26_00-04-12_236521_437/runtime_resources/working_dir_files/_ray_pkg_8abdeac0a60ab9ea/****/model/model.py", line 219, in forward
    embeddings[channel_name] = self.normalize(encoder(batch[channel_name]))
  File "/opt/miniconda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/miniconda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/tmp/ray/session_2024-11-26_00-04-12_236521_437/runtime_resources/working_dir_files/_ray_pkg_8abdeac0a60ab9ea/****/model/encoders/uvnet.py", line 465, in forward
    hidden_crv_feat = self.curv_encoder(input_crv_feat)
  File "/opt/miniconda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/miniconda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/tmp/ray/session_2024-11-26_00-04-12_236521_437/runtime_resources/working_dir_files/_ray_pkg_8abdeac0a60ab9ea/****/model/encoders/uvnet.py", line 153, in forward
    x = self.conv1(x)
  File "/opt/miniconda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/miniconda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/miniconda/lib/python3.10/site-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
  File "/opt/miniconda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/miniconda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/miniconda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 310, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/opt/miniconda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 306, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (CUDABFloat16Type) should be the same
```


**To Reproduce**
Proprietary code, so cannot share full code. 

Deepspeed Configuration:
```
deepspeed_configs = {
            "zero_allow_untested_optimizer": True,
            "bf16": {
                "enabled": "auto"
            },
            "zero_optimization": {
                "stage": 2,
                "offload_optimizer": {"device": "cpu", "pin_memory": True},
                "offload_param": {"device": "cpu", "pin_memory": True},
                "overlap_comm": True,
                "contiguous_gradients": True,
                "allgather_bucket_size": 500000000,
                "reduce_bucket_size": 500000000
            },
            "activation_checkpointing": {
                "partition_activations": True,
                "cpu_checkpointing": True
            }
        }
```

**Expected behavior**
When we train using DDP, we  don't get this error. But as soon as we use `deepspeed==0.15.2`, it returns input type & weight type mismatch

**ds_report output**

**Screenshots**
If applicable, add screenshots to help explain your problem.

**System info (please complete the following information):**
 - OS: [e.g. Ubuntu 18.04] Ubuntu 22.04
 - GPU count and types [e.g. two machines with x8 A100s each] x16 H100s
 - Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
 - Python version 3.10.1
 - Any other relevant info about your setup

**Launcher context**
Ray Lightning, Ray VM Launcher

**Docker context**
Are you using a specific docker image that you can share?

**Additional context**
Add any other context about the problem here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (CUDABFloat16Type) should be the same #6789

rileyhun
openedon Nov 26, 2024

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (CUDABFloat16Type) should be the same #6789

Description

rileyhunopenedon Nov 26, 2024

Metadata