Description
Found on this code versions: 5258501, huggingface/accelerate@12a007d, pytorch/pytorch@3477ee3. This is an issue with XPU support in stock pytorch (i.e. without using IPEX).
HF model pipelines with device_map="auto"
(or device_map="sequential"
) does not actually run on XPU even if they can fit the device memory. I spotted that trying to run LLAMA 3 models:
- https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
- https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct
Example script:
import transformers
import torch
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]
terminators = [
pipeline.tokenizer.eos_token_id,
pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = pipeline(
messages,
max_new_tokens=256,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9,
)
print(outputs[0]["generated_text"][-1])
Workarounds and findings:
- If model fits device memory, then changing
device_map="auto"
todevice_map="xpu"
will allow model to run (that's easier to check on 8B model) - Model starts to also work (but see a note below) if you add
max_memory
to the model kwargs:
model_kwargs={"torch_dtype": torch.bfloat16, "max_memory": {0: 5.0e+10}}, device_map="auto",
- NOTE: adding
max_memory
will currently work only if model fits into device memory and you provided big enough max_limit. If not, then you will see the following error (filed separate cuda device is wrongly requested instead of xpu running pipeline(device_map="auto", max_memory": {0: 1.0e+10}) #31941 for this):
...
File "/home/gta/git/huggingface/accelerate/src/accelerate/utils/offload.py", line 118, in __getitem__
return self.dataset[f"{self.prefix}{key}"]
File "/home/gta/git/huggingface/accelerate/src/accelerate/utils/offload.py", line 171, in __getitem__
tensor = f.get_tensor(weight_info.get("weight_name", key))
File "/home/gta/git/pytorch/pytorch/torch/cuda/__init__.py", line 305, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled
CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5 @sywangyi @yao-matrix
Activity