Skip to content

[BUG] RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (CUDABFloat16Type) should be the same #6789

Open

Description

Describe the bug

We are using deepspeed to pre-train a model, but we are getting the following error:

 File "/tmp/ray/session_2024-11-26_00-04-12_236521_437/runtime_resources/working_dir_files/_ray_pkg_8abdeac0a60ab9ea/****/model/model.py", line 246, in validation_step
    embeddings = self.forward(batch)
  File "/tmp/ray/session_2024-11-26_00-04-12_236521_437/runtime_resources/working_dir_files/_ray_pkg_8abdeac0a60ab9ea/****/model/model.py", line 219, in forward
    embeddings[channel_name] = self.normalize(encoder(batch[channel_name]))
  File "/opt/miniconda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/miniconda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/tmp/ray/session_2024-11-26_00-04-12_236521_437/runtime_resources/working_dir_files/_ray_pkg_8abdeac0a60ab9ea/****/model/encoders/uvnet.py", line 465, in forward
    hidden_crv_feat = self.curv_encoder(input_crv_feat)
  File "/opt/miniconda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/miniconda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/tmp/ray/session_2024-11-26_00-04-12_236521_437/runtime_resources/working_dir_files/_ray_pkg_8abdeac0a60ab9ea/****/model/encoders/uvnet.py", line 153, in forward
    x = self.conv1(x)
  File "/opt/miniconda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/miniconda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/miniconda/lib/python3.10/site-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
  File "/opt/miniconda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/miniconda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/miniconda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 310, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/opt/miniconda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 306, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (CUDABFloat16Type) should be the same

To Reproduce
Proprietary code, so cannot share full code.

Deepspeed Configuration:

deepspeed_configs = {
            "zero_allow_untested_optimizer": True,
            "bf16": {
                "enabled": "auto"
            },
            "zero_optimization": {
                "stage": 2,
                "offload_optimizer": {"device": "cpu", "pin_memory": True},
                "offload_param": {"device": "cpu", "pin_memory": True},
                "overlap_comm": True,
                "contiguous_gradients": True,
                "allgather_bucket_size": 500000000,
                "reduce_bucket_size": 500000000
            },
            "activation_checkpointing": {
                "partition_activations": True,
                "cpu_checkpointing": True
            }
        }

Expected behavior
When we train using DDP, we don't get this error. But as soon as we use deepspeed==0.15.2, it returns input type & weight type mismatch

ds_report output

Screenshots
If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

  • OS: [e.g. Ubuntu 18.04] Ubuntu 22.04
  • GPU count and types [e.g. two machines with x8 A100s each] x16 H100s
  • Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
  • Python version 3.10.1
  • Any other relevant info about your setup

Launcher context
Ray Lightning, Ray VM Launcher

Docker context
Are you using a specific docker image that you can share?

Additional context
Add any other context about the problem here.

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtraining

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions