-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Meta Llama 3.1 Know Issues & FAQ #6689
Comments
Just me or are other people also having issues running llama 3.1 models? My error:
|
I think there's some issue with parsing the config for
Version:
|
I'm having the same issue as @Vaibhav-Sahai Command:
Error trace:
Version:
[RESOLVED]: |
Hello there! @romilbhardwaj @arkilpatel Thanks for reporting the issue and we are aware of it. This is due to the fact that HuggingFace decides to rename this key ("rope_type" instead of "type") in the repo of all Llama 3.1 models. |
Please update vLLM version to v0.5.3.post1. |
@simon-mo maybe i missed something, but getting the following log:
LLM config:
Version:
EDIT: works when trying EDIT2: updating to 0.5.4 makes this work without any additional flags. Thank you to all collaborators! |
@Vaibhav-Sahai updated my hack, PTAL. |
@Vaibhav-Sahai @romilbhardwaj @casper-hansen We fixed the RoPE issue in #6693. The model should work without any extra args now in the main branch. |
Thanks @WoosukKwon. Are you guys planning any post release today or should we build from source until 0.5.4? |
We plan to release ASAP after confirmation with the HuggingFace team. |
Llama 3.1 405B base in fp8 is just generating !!!! over and over for me, is anyone else having this issue? I verified that 70B base works for me (but it's not in fp8). |
Regarding the rope issue, the new version has been released with the fix. Please test it out! |
Hey is this the same issue?
I am using the following configuration
As suggested used |
Please try update the vLLM version and you don't need the rope scaling anymore |
@simon-mo, same issue as @akhil-netomi for me. Just updated to post1, I unfortunately get the same issue? I'm trying the 70B instruct variant - I tried with and without the hotfix. Error is in the |
This might be where the issue lies - although I'm not familiar with the codebase |
Hey! I updated to the latest vLLM version and Llama3.1-70B works |
@sumukshashidhar are you trying to use It would be great if you can paste the whole stacktrace so we can see if the error is coming from vLLM or transformers. |
405B instruct FP8 works for me. It's just the base model that is not working. Also, base and instruct seem to use different amounts of GPU memory, which should not happen (base uses less). |
@sumukshashidhar @akhil-netomi Please try upgrading |
For those that want to test vLLM with gptq_marlin compatible 4bit quants we have just pushed both 8B models to HF: https://huggingface.co/ModelCloud/Meta-Llama-3.1-8B-Instruct-gptq-4bit |
@ywang96 my bad, I'm not trying to use chameleon, my error is in the normal transformers module. @WoosukKwon yes, upgrading transformers fixes it. Thanks! |
Do you run 405B with cuda 11.8? The cuda version should be 12.X to support FP8 |
@pseudotensor |
I keep getting the following error when launching Llama 3.1 70B/70B-Instruct:
I'm launching the model using 4 a40 GPUs, and the same hardwares have no issues hosting Llama 3 70B. I first noticed this error with |
@XkunW llama 3.1 has longer context length (128k by default). so this is expected. |
So I should set context length to a number small enough that it should always be smaller than KV cache size? |
yes, as the error suggests, you can add |
ERROR: Exception in ASGI application
During handling of the above exception, another exception occurred: Traceback (most recent call last): I got this error, command: on A800 * 8 * 80G,dgx this error ocurred when i use Llama-3.1 models, 70B or 405B-fp8. Llama-3 models 70B has no problem! |
@youkaichao I am having an issue when trying to run llama 3.1 405b on 8 nodes each with 4 64Gb A100 gpus, you can see the specific here https://leonardo-supercomputer.cineca.eu/hpc-system/ Since pipeline parallelism is not supported with LLMEngine, I am using This is the full stack trace, it appears to hang when creating cuda graphs, though this might not be the case, any chance you could help?
|
Hi, I am facing an issue with serving a fine tuned version of llama 3.1. The issue I have is not due to the deployment itself as the deployment works fine with this command: sudo docker run --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=abc" -p 8000:8000 --ipc=host docker.io/vllm/vllm-openai:latest --model shress/llama3.1-16bit --max_model_len=8000 But when I send requests to the endpoint I get very random responses. I saw the logs on the server and noticed that the chat_template applied doesnt apply the tokens correctly based on the roles. Do we need to create manually a chat template and apply it while deployment? Or is there some llama 3.1 chat template offered by vllm that can automatically apply the necessary tokens for each role |
@youkaichao Thanks! Indeed that fixes it, I was able to generate text on 8 nodes and 32 gpus! Also if you know, is pipeline parallel going to be supported for the |
you should use pipeline parallel with the api server directly, via |
Unfortunately I can't easily do that because the nodes where I can run this have no access to the internet or any connection outside, so I would need to do some strange things to log in to the nodes myself or smth like that. Thank you very much! |
For those who are having problem launching |
llama 3.1 405b crashes when running 8 requests. Im running with tp64. very frustrating as it took 3 hrs to load the model from s3
|
I am using the vLLM version
|
I am trying to load the LLaMA 3.1 70B model on two different servers. The configuration of the servers is as follows:
When I load the model on Server-1 using the following command:
4 GPUs on Server-1are each allocated 11GB of memory. Next, I attempted to distribute the model across Server-1 and Server-2 by connecting them via a Ray cluster. I used the following command:
After this, I observed that:
I expected that using more GPUs across both servers would reduce memory usage per GPU. However, the allocation remains at 11GB per GPU on both servers. Can anyone clarify why this is happening, and why adding more GPUs doesn't seem to reduce the memory usage per GPU? |
Like @timxzz, I too was getting the error about there not being enough space for the KV Cache when running @timxzz Hopefully setting |
Wow hours spent and finally found this solution! |
Amazing! This works! Thanks! |
Please checkout Announcing Llama 3.1 Support in vLLM
--enable-chunked-prefill=false
then optionally combine it with--max-model-len=4096
if turning it out cause OOM. You can change the length for the context window you desired.if rope_scaling is not None and rope_scaling["type"] not in, KeyError: 'type'.
ValueError: 'rope_scaling' must be a dictionary with two fields, 'type' and 'factor', got {'factor': 8.0, 'low_freq_factor': 1.0, 'high_freq_factor': 4.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}
pip install transformers --upgrade
)UPDATE:
meta-llama/Meta-Llama-3.1-405B-Instruct-FP8
model repository has been fixed with the correct number of kv heads. Please try launching with default vLLM args and the updated model weights!The text was updated successfully, but these errors were encountered: