xpu device is not used running pipeline(device_map="auto")

Found on this code versions: https://github.com/huggingface/transformers/commit/52585019a17f6033df64e6ad4222d22a1f993c61,  https://github.com/huggingface/accelerate/commit/12a007d55937345aa986f5d7b1a1b6f2038465a7, https://github.com/pytorch/pytorch/commit/3477ee38e4dd1429ecfd7e6f20a30cce0f4f78e7. This is an issue with XPU support in stock pytorch (i.e. without using IPEX).

HF model pipelines with `device_map="auto"` (or `device_map="sequential"`) does not actually run on XPU even if they can fit the device memory. I spotted that trying to run LLAMA 3 models:
* https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
* https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct

Example script:
```
import transformers
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)
messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]
terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = pipeline(
    messages,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
print(outputs[0]["generated_text"][-1])
```

Workarounds and findings:
* If model fits device memory, then changing `device_map="auto"` to `device_map="xpu"` will allow model to run (that's easier to check on 8B model)
* Model starts to also work (but see a note below) if you add `max_memory` to the model kwargs:
```
model_kwargs={"torch_dtype": torch.bfloat16, "max_memory": {0: 5.0e+10}}, device_map="auto",
```
* NOTE: adding `max_memory` will currently work only if model fits into device memory and you provided big enough max_limit. If not, then you will see the following error (filed separate https://github.com/huggingface/transformers/issues/31941 for this):
```
...
  File "/home/gta/git/huggingface/accelerate/src/accelerate/utils/offload.py", line 118, in __getitem__
    return self.dataset[f"{self.prefix}{key}"]
  File "/home/gta/git/huggingface/accelerate/src/accelerate/utils/offload.py", line 171, in __getitem__
    tensor = f.get_tensor(weight_info.get("weight_name", key))
  File "/home/gta/git/pytorch/pytorch/torch/cuda/__init__.py", line 305, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled
```

CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5 @sywangyi  @yao-matrix


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xpu device is not used running pipeline(device_map="auto") #31922

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

xpu device is not used running pipeline(device_map="auto") #31922

Description

Activity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions