Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model] Meta Llama 3.1 Know Issues & FAQ #6689

Closed
simon-mo opened this issue Jul 23, 2024 · 85 comments
Closed

[Model] Meta Llama 3.1 Know Issues & FAQ #6689

simon-mo opened this issue Jul 23, 2024 · 85 comments

Comments

@simon-mo
Copy link
Collaborator

simon-mo commented Jul 23, 2024

Please checkout Announcing Llama 3.1 Support in vLLM

  • Chunked prefill is turned on for all Llama 3.1 models. However, it is currently incompatible with prefix caching, sliding window, and multi-lora. In order to use those features, you can set --enable-chunked-prefill=false then optionally combine it with --max-model-len=4096 if turning it out cause OOM. You can change the length for the context window you desired.
  • Rope scaling if rope_scaling is not None and rope_scaling["type"] not in, KeyError: 'type'.
  • Rope scaling ValueError: 'rope_scaling' must be a dictionary with two fields, 'type' and 'factor', got {'factor': 8.0, 'low_freq_factor': 1.0, 'high_freq_factor': 4.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}
    • Please upgrade transformers to 4.43.1 (pip install transformers --upgrade)
  • Using a per-request random seed currently does not work with pipeline parallel deployments ([Bug]: Seed issue with Pipeline Parallel #6449). This will be fixed soon.

UPDATE: meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 model repository has been fixed with the correct number of kv heads. Please try launching with default vLLM args and the updated model weights!

@simon-mo simon-mo added misc and removed misc labels Jul 23, 2024
@simon-mo simon-mo pinned this issue Jul 23, 2024
@Vaibhav-Sahai
Copy link

Just me or are other people also having issues running llama 3.1 models? My error:
if rope_scaling is not None and rope_scaling["type"] not in, KeyError: 'type'.
Config:

llm = LLM(model=MODEL, 
              tensor_parallel_size=NUM_GPUS, 
              enable_prefix_caching=False, 
              gpu_memory_utilization=0.80, 
              max_model_len=4096, 
              trust_remote_code=True,
              max_num_seqs=16,
              )

@romilbhardwaj
Copy link

romilbhardwaj commented Jul 23, 2024

I think there's some issue with parsing the config for meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 fetched from huggingface:

$ vllm serve meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tensor-parallel-size 8
INFO 07-23 16:04:42 api_server.py:219] vLLM API server version 0.5.3
INFO 07-23 16:04:42 api_server.py:220] args: Namespace(model_tag='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', host=None, port=8000, uvicorn_log_level='info', allow_credential
s=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7fee7856c5e0>)
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 22.5k/22.5k [00:00<00:00, 85.5MB/s]
Traceback (most recent call last):
  File "/opt/conda/bin/vllm", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/vllm/scripts.py", line 148, in main
    args.dispatch_function(args)
  File "/opt/conda/lib/python3.10/site-packages/vllm/scripts.py", line 28, in serve
    run_server(args)
  File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 231, in run_server
    if llm_engine is not None else AsyncLLMEngine.from_engine_args(
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 457, in from_engine_args
    engine_config = engine_args.create_engine_config()
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 699, in create_engine_config
    model_config = ModelConfig(
  File "/opt/conda/lib/python3.10/site-packages/vllm/config.py", line 176, in __init__
    self.max_model_len = _get_and_verify_max_len(
  File "/opt/conda/lib/python3.10/site-packages/vllm/config.py", line 1497, in _get_and_verify_max_len
    if rope_scaling is not None and rope_scaling["type"] not in {
KeyError: 'type'

Version:

pip freeze | grep vllm
vllm==0.5.3
vllm-flash-attn==2.5.9.post1

@arkilpatel
Copy link

arkilpatel commented Jul 23, 2024

I'm having the same issue as @Vaibhav-Sahai

Command:

python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-70B-Instruct --tensor-parallel-size 2

Error trace:

File "envs/synthenv/lib/python3.10/site-packages/vllm/config.py", line 176, in __init__
  self.max_model_len = _get_and_verify_max_len(
File "envs/synthenv/lib/python3.10/site-packages/vllm/config.py", line 1497, in _get_and_verify_max_len
  if rope_scaling is not None and rope_scaling["type"] not in {
KeyError: 'type'

Version:

vllm==0.5.3
vllm-flash-attn==2.5.9.post1

[RESOLVED]:
Adding this to the CLI works: --rope-scaling='{"type": "extended", "factor": 8.0}'
Thanks @simon-mo!

@ywang96
Copy link
Member

ywang96 commented Jul 23, 2024

Hello there! @romilbhardwaj @arkilpatel Thanks for reporting the issue and we are aware of it. This is due to the fact that HuggingFace decides to rename this key ("rope_type" instead of "type") in the repo of all Llama 3.1 models.

@simon-mo
Copy link
Collaborator Author

simon-mo commented Jul 23, 2024

In CLI or Python, pass in --rope-scaling='{"type": "extended", "factor": 8.0}' or rope_scaling={"type": "extended", "factor": 8.0} should get around this for now

Please update vLLM version to v0.5.3.post1.

@Vaibhav-Sahai
Copy link

Vaibhav-Sahai commented Jul 23, 2024

@simon-mo maybe i missed something, but getting the following log:

assert "factor" in rope_scaling
AssertionError

LLM config:

llm = LLM(model=MODEL, 
              tensor_parallel_size=NUM_GPUS, 
              enable_prefix_caching=False, 
              gpu_memory_utilization=0.90, 
              max_model_len=4096, 
              trust_remote_code=True,
              max_num_seqs=16,
              rope_scaling={"type": "dummy"},
 )

Version:

vllm==0.5.3

EDIT: works when trying rope_scaling={"type": "extended", "factor": 8.0}. Thanks @simon-mo!

EDIT2: updating to 0.5.4 makes this work without any additional flags. Thank you to all collaborators!

@casper-hansen
Copy link
Contributor

casper-hansen commented Jul 23, 2024

@ywang96 renaming rope_type to type does not work either. I was just wanting to run some benchmarks of the AWQ models, but I cannot seem to get it running at the moment due to the expected vs actual parameters for rope in the config.

Any specific solution for this? EDIT: Nevermind, I see #6693

@simon-mo
Copy link
Collaborator Author

@Vaibhav-Sahai updated my hack, PTAL.

@WoosukKwon
Copy link
Collaborator

@Vaibhav-Sahai @romilbhardwaj @casper-hansen We fixed the RoPE issue in #6693. The model should work without any extra args now in the main branch.

@casper-hansen
Copy link
Contributor

Thanks @WoosukKwon. Are you guys planning any post release today or should we build from source until 0.5.4?

@simon-mo
Copy link
Collaborator Author

We plan to release ASAP after confirmation with the HuggingFace team.

@crowsonkb
Copy link

crowsonkb commented Jul 23, 2024

Llama 3.1 405B base in fp8 is just generating !!!! over and over for me, is anyone else having this issue? I verified that 70B base works for me (but it's not in fp8).

@simon-mo
Copy link
Collaborator Author

Regarding the rope issue, the new version has been released with the fix. Please test it out!

@akhil-netomi
Copy link

Hey is this the same issue?

ValueError: `rope_scaling` must be a dictionary with two fields, `type` and `factor`, got {'factor': 8.0, 'low_freq_factor': 1.0, 'high_freq_factor': 4.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}

I am using the following configuration

Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=8192, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling={'type': 'extended', 'factor': 8.0}, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)

As suggested used rope_scaling={"type": "extended", "factor": 8.0} this with vllm openai server

@simon-mo
Copy link
Collaborator Author

Please try update the vLLM version and you don't need the rope scaling anymore

@sumukshashidhar
Copy link

sumukshashidhar commented Jul 23, 2024

@simon-mo, same issue as @akhil-netomi for me. Just updated to post1, I unfortunately get the same issue?

I'm trying the 70B instruct variant - I tried with and without the hotfix. Error is in the _rope_scaling_validation method

@sumukshashidhar
Copy link

if self.rope_scaling is None:

This might be where the issue lies - although I'm not familiar with the codebase

@ayushchakravarthy
Copy link

Hey!
trying to get Llama3.1-405B-FP8 working with vLLM, getting this error
RuntimeError: "fill_empty_deterministic_" not implemented for 'Float8_e4m3fn'

I updated to the latest vLLM version and Llama3.1-70B works

@ywang96
Copy link
Member

ywang96 commented Jul 23, 2024

@sumukshashidhar are you trying to use chameleon? That class won't be touched unless you're trying to serve ChameleonForConditionalGeneration.

It would be great if you can paste the whole stacktrace so we can see if the error is coming from vLLM or transformers.

@crowsonkb
Copy link

Llama 3.1 405B base in fp8 is just generating !!!! over and over for me, is anyone else having this issue? I verified that 70B base works for me (but it's not in fp8).

405B instruct FP8 works for me. It's just the base model that is not working. Also, base and instruct seem to use different amounts of GPU memory, which should not happen (base uses less).

@WoosukKwon
Copy link
Collaborator

WoosukKwon commented Jul 23, 2024

@sumukshashidhar @akhil-netomi Please try upgrading transformers with pip install --upgrade transformers

@Qubitium
Copy link
Contributor

Qubitium commented Jul 23, 2024

@sumukshashidhar
Copy link

@ywang96 my bad, I'm not trying to use chameleon, my error is in the normal transformers module. @WoosukKwon yes, upgrading transformers fixes it. Thanks!

@DriverSong
Copy link
Contributor

Got this:

UNAVAILABLE: Internal: RuntimeError: Expected a.dtype() == torch::kInt8 to be true, but got false. (Could this error message be improved? If so, please report an enhancement request .

I'm using vllm 0.5.3-post1 and Meta-Llama-3.1-405B-Instruct-FP8

Do you run 405B with cuda 11.8? The cuda version should be 12.X to support FP8

@linpan
Copy link

linpan commented Jul 31, 2024

gptq_marlin not found.
CleanShot 2024-07-31 at 11 46 18

@kathanpatel52
Copy link

QQ, what is good choice for max-model-len for 8*H100 for 405B?

Should I just find out? If anyone has suggestions welcome. That is, 128k is default, I presume.

But should I go for 16k, 32k, 64k, 128k? What can 8*H100 support at full context length?

@pseudotensor
did you got answer for this?
did you tested with 64K or more context length on 8*H100 ?

@XkunW
Copy link

XkunW commented Aug 6, 2024

I keep getting the following error when launching Llama 3.1 70B/70B-Instruct:

�[1;36m(VllmWorkerProcess pid=1118)�[0;0m ERROR 08-06 14:58:39 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method initialize_cache: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (106384). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine., Traceback (most recent call last):
�[1;36m(VllmWorkerProcess pid=1118)�[0;0m ERROR 08-06 14:58:39 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
�[1;36m(VllmWorkerProcess pid=1118)�[0;0m ERROR 08-06 14:58:39 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
�[1;36m(VllmWorkerProcess pid=1118)�[0;0m ERROR 08-06 14:58:39 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/site-packages/vllm/worker/worker.py", line 214, in initialize_cache
�[1;36m(VllmWorkerProcess pid=1118)�[0;0m ERROR 08-06 14:58:39 multiproc_worker_utils.py:226]     raise_if_cache_size_invalid(num_gpu_blocks,
�[1;36m(VllmWorkerProcess pid=1118)�[0;0m ERROR 08-06 14:58:39 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/site-packages/vllm/worker/worker.py", line 374, in raise_if_cache_size_invalid
�[1;36m(VllmWorkerProcess pid=1118)�[0;0m ERROR 08-06 14:58:39 multiproc_worker_utils.py:226]     raise ValueError(
�[1;36m(VllmWorkerProcess pid=1118)�[0;0m ERROR 08-06 14:58:39 multiproc_worker_utils.py:226] ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (106384). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
�[1;36m(VllmWorkerProcess pid=1118)�[0;0m ERROR 08-06 14:58:39 multiproc_worker_utils.py:226] 

I'm launching the model using 4 a40 GPUs, and the same hardwares have no issues hosting Llama 3 70B. I first noticed this error with vllm v0.5.3.post1, and it persists with v0.5.4. Every time I try to launch one of these models the KV cache size is different, so fixing max_model_len to a smaller number wouldn't work. I've already increased gpu_memory_utilization to 0.95, and setting this parameter to 1 seems to cause CUDA out of memory errors.

@youkaichao
Copy link
Member

@XkunW llama 3.1 has longer context length (128k by default). so this is expected.

@XkunW
Copy link

XkunW commented Aug 6, 2024

@XkunW llama 3.1 has longer context length (128k by default). so this is expected.

So I should set context length to a number small enough that it should always be smaller than KV cache size?

@youkaichao
Copy link
Member

yes, as the error suggests, you can add --max_model_len 106000

@mangomatrix
Copy link

mangomatrix commented Aug 7, 2024

ERROR: Exception in ASGI application

  • Exception Group Traceback (most recent call last):
    | File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/_utils.py", line 87, in collapse_excgroups
    | yield
    | File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/middleware/base.py", line 190, in call
    | async with anyio.create_task_group() as task_group:
    | File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 680, in aexit
    | raise BaseExceptionGroup(
    | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
    +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    | File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    | result = await app( # type: ignore[func-returns-value]
    | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    | File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in call
    | return await self.app(scope, receive, send)
    | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    | File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/fastapi/applications.py", line 1054, in call
    | await super().call(scope, receive, send)
    | File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/applications.py", line 123, in call
    | await self.middleware_stack(scope, receive, send)
    | File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/middleware/errors.py", line 186, in call
    | raise exc
    | File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/middleware/errors.py", line 164, in call
    | await self.app(scope, receive, _send)
    | File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/middleware/base.py", line 189, in call
    | with collapse_excgroups():
    | File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/contextlib.py", line 158, in exit
    | self.gen.throw(value)
    | File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/_utils.py", line 93, in collapse_excgroups
    | raise exc
    | File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/middleware/base.py", line 191, in call
    | response = await self.dispatch_func(request, call_next)
    | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    | File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 260, in authentication
    | return await call_next(request)
    | ^^^^^^^^^^^^^^^^^^^^^^^^
    | File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/middleware/base.py", line 165, in call_next
    | raise app_exc
    | File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/middleware/base.py", line 151, in coro
    | await self.app(scope, receive_or_disconnect, send_no_error)
    | File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/middleware/cors.py", line 85, in call
    | await self.app(scope, receive, send)
    | File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/middleware/exceptions.py", line 65, in call
    | await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
    | File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    | raise exc
    | File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    | await app(scope, receive, sender)
    | File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/routing.py", line 756, in call
    | await self.middleware_stack(scope, receive, send)
    | File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/routing.py", line 776, in app
    | await route.handle(scope, receive, send)
    | File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/routing.py", line 297, in handle
    | await self.app(scope, receive, send)
    | File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/routing.py", line 77, in app
    | await wrap_app_handling_exceptions(app, request)(scope, receive, send)
    | File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    | raise exc
    | File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    | await app(scope, receive, sender)
    | File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/routing.py", line 72, in app
    | response = await func(request)
    | ^^^^^^^^^^^^^^^^^^^
    | File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/fastapi/routing.py", line 278, in app
    | raw_response = await run_endpoint_function(
    | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    | File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    | return await dependant.call(**values)
    | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    | File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 189, in create_chat_completion
    | generator = await openai_serving_chat.create_chat_completion(
    | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    | File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/vllm/entrypoints/openai/serving_chat.py", line 185, in create_chat_completion
    | return await self.chat_completion_full_generator(
    | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    | File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/vllm/entrypoints/openai/serving_chat.py", line 436, in chat_completion_full_generator
    | async for res in result_generator:
    | File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/vllm/entrypoints/openai/rpc/client.py", line 216, in generate
    | raise request_output
    | vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.
    +------------------------------------

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
result = await app( # type: ignore[func-returns-value]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in call
return await self.app(scope, receive, send)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/fastapi/applications.py", line 1054, in call
await super().call(scope, receive, send)
File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/applications.py", line 123, in call
await self.middleware_stack(scope, receive, send)
File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/middleware/errors.py", line 186, in call
raise exc
File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/middleware/errors.py", line 164, in call
await self.app(scope, receive, _send)
File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/middleware/base.py", line 189, in call
with collapse_excgroups():
File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/contextlib.py", line 158, in exit
self.gen.throw(value)
File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/_utils.py", line 93, in collapse_excgroups
raise exc
File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/middleware/base.py", line 191, in call
response = await self.dispatch_func(request, call_next)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 260, in authentication
return await call_next(request)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/middleware/base.py", line 165, in call_next
raise app_exc
File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/middleware/base.py", line 151, in coro
await self.app(scope, receive_or_disconnect, send_no_error)
File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/middleware/cors.py", line 85, in call
await self.app(scope, receive, send)
File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/middleware/exceptions.py", line 65, in call
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/routing.py", line 756, in call
await self.middleware_stack(scope, receive, send)
File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/routing.py", line 776, in app
await route.handle(scope, receive, send)
File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/starlette/routing.py", line 72, in app
response = await func(request)
^^^^^^^^^^^^^^^^^^^
File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/fastapi/routing.py", line 278, in app
raw_response = await run_endpoint_function(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 189, in create_chat_completion
generator = await openai_serving_chat.create_chat_completion(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/vllm/entrypoints/openai/serving_chat.py", line 185, in create_chat_completion
return await self.chat_completion_full_generator(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/vllm/entrypoints/openai/serving_chat.py", line 436, in chat_completion_full_generator
async for res in result_generator:
File "/home/datagrand/miniconda3/envs/vllm_new_hw/lib/python3.12/site-packages/vllm/entrypoints/openai/rpc/client.py", line 216, in generate
raise request_output
vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.

I got this error,
env:
vllm 0.5.4+cu118

command:
vllm serve /mnt/models/Meta-Llama-3.1-70B-Instruct --tensor-parallel-size 8 --api-key token-abc123 --host 0.0.0.0 --port 88

on A800 * 8 * 80G,dgx

this error ocurred when i use Llama-3.1 models, 70B or 405B-fp8. Llama-3 models 70B has no problem!

@gpucce
Copy link

gpucce commented Aug 7, 2024

@youkaichao I am having an issue when trying to run llama 3.1 405b on 8 nodes each with 4 64Gb A100 gpus, you can see the specific here https://leonardo-supercomputer.cineca.eu/hpc-system/

Since pipeline parallelism is not supported with LLMEngine, I am using tensor_parallel_size=32

This is the full stack trace, it appears to hang when creating cuda graphs, though this might not be the case, any chance you could help?

/var/spool/slurmd/job6730439/slurm_script: line 14: /leonardo_scratch/large/userexternal/gpuccett/Repos/llm_newest/conda_venv/bin/conda: No such file or directory
IP Head: lrdn1699:6379
STARTING HEAD at lrdn1699
STARTING WORKER 1 at lrdn1739
IP Head: lrdn1699:6379
Node: lrdn1739 IP Local: 10.5.1.43
STARTING WORKER 2 at lrdn1760
IP Head: lrdn1699:6379
Node: lrdn1760 IP Local: 10.5.1.64
STARTING WORKER 3 at lrdn1766
IP Head: lrdn1699:6379
Node: lrdn1766 IP Local: 10.5.1.70
STARTING WORKER 4 at lrdn1778
IP Head: lrdn1699:6379
Node: lrdn1778 IP Local: 10.5.1.82
STARTING WORKER 5 at lrdn1788
IP Head: lrdn1699:6379
Node: lrdn1788 IP Local: 10.5.1.92
STARTING WORKER 6 at lrdn1813
IP Head: lrdn1699:6379
Node: lrdn1813 IP Local: 10.6.0.13
STARTING WORKER 7 at lrdn1855
IP Head: lrdn1699:6379
Node: lrdn1855 IP Local: 10.6.0.55
STARTING python command
2024-08-07 10:14:26,200 - INFO - Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2024-08-07 10:14:26,200 - INFO - Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2024-08-07 10:14:26,200 - INFO - Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2024-08-07 10:14:26,200 - INFO - Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2024-08-07 10:14:26,201 - INFO - NumExpr defaulting to 8 threads.
2024-08-07 10:14:26,200 - INFO - Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2024-08-07 10:14:26,200 - INFO - Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2024-08-07 10:14:26,200 - INFO - Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2024-08-07 10:14:26,200 - INFO - Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2024-08-07 10:14:26,200 - INFO - NumExpr defaulting to 8 threads.
2024-08-07 10:14:26,200 - INFO - NumExpr defaulting to 8 threads.
2024-08-07 10:14:26,200 - INFO - NumExpr defaulting to 8 threads.
2024-08-07 10:14:26,200 - INFO - NumExpr defaulting to 8 threads.
2024-08-07 10:14:26,200 - INFO - NumExpr defaulting to 8 threads.
2024-08-07 10:14:26,201 - INFO - NumExpr defaulting to 8 threads.
2024-08-07 10:14:26,200 - INFO - NumExpr defaulting to 8 threads.
2024-08-07 10:14:30,579	INFO usage_lib.py:443 -- Usage stats collection is disabled.
2024-08-07 10:14:30,579	INFO scripts.py:763 -- Local node IP: lrdn1699
2024-08-07 10:14:30,582	INFO scripts.py:945 -- Local node IP: 10.5.1.70
2024-08-07 10:14:30,585	INFO scripts.py:945 -- Local node IP: 10.6.0.13
2024-08-07 10:14:30,585	INFO scripts.py:945 -- Local node IP: 10.6.0.55
2024-08-07 10:14:30,586	INFO scripts.py:945 -- Local node IP: 10.5.1.43
2024-08-07 10:14:30,585	INFO scripts.py:945 -- Local node IP: 10.5.1.64
2024-08-07 10:14:30,586	INFO scripts.py:945 -- Local node IP: 10.5.1.92
2024-08-07 10:14:30,586	INFO scripts.py:945 -- Local node IP: 10.5.1.82
2024-08-07 10:14:35,102	SUCC scripts.py:800 -- --------------------
2024-08-07 10:14:35,102	SUCC scripts.py:801 -- Ray runtime started.
2024-08-07 10:14:35,102	SUCC scripts.py:802 -- --------------------
2024-08-07 10:14:35,102	INFO scripts.py:804 -- Next steps
2024-08-07 10:14:35,102	INFO scripts.py:807 -- To add another node to this Ray cluster, run
2024-08-07 10:14:35,102	INFO scripts.py:810 --   ray start --address='lrdn1699:6379'
2024-08-07 10:14:35,102	INFO scripts.py:819 -- To connect to this Ray cluster:
2024-08-07 10:14:35,102	INFO scripts.py:821 -- import ray
2024-08-07 10:14:35,102	INFO scripts.py:822 -- ray.init(_node_ip_address='lrdn1699')
2024-08-07 10:14:35,102	INFO scripts.py:853 -- To terminate the Ray runtime, run
2024-08-07 10:14:35,102	INFO scripts.py:854 --   ray stop
2024-08-07 10:14:35,102	INFO scripts.py:857 -- To view the status of the cluster, use
2024-08-07 10:14:35,102	INFO scripts.py:858 --   ray status
2024-08-07 10:14:35,103	INFO scripts.py:971 -- --block
2024-08-07 10:14:35,103	INFO scripts.py:972 -- This command will now block forever until terminated by a signal.
2024-08-07 10:14:35,103	INFO scripts.py:975 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.
[2024-08-07 10:14:35,261 I 215234 215234] global_state_accessor.cc:432: This node has an IP address of 10.6.0.13, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
[2024-08-07 10:14:35,261 I 699603 699603] global_state_accessor.cc:432: This node has an IP address of 10.5.1.82, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
[2024-08-07 10:14:35,261 I 668295 668295] global_state_accessor.cc:432: This node has an IP address of 10.5.1.92, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
[2024-08-07 10:14:35,261 I 727164 727164] global_state_accessor.cc:432: This node has an IP address of 10.6.0.55, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
[2024-08-07 10:14:35,261 I 699356 699356] global_state_accessor.cc:432: This node has an IP address of 10.5.1.64, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
[2024-08-07 10:14:35,261 I 684730 684730] global_state_accessor.cc:432: This node has an IP address of 10.5.1.43, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
[2024-08-07 10:14:35,261 I 664849 664849] global_state_accessor.cc:432: This node has an IP address of 10.5.1.70, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
2024-08-07 10:14:35,296	SUCC scripts.py:958 -- --------------------
2024-08-07 10:14:35,296	SUCC scripts.py:958 -- --------------------
2024-08-07 10:14:35,296	SUCC scripts.py:959 -- Ray runtime started.
2024-08-07 10:14:35,296	SUCC scripts.py:959 -- Ray runtime started.
2024-08-07 10:14:35,296	SUCC scripts.py:960 -- --------------------
2024-08-07 10:14:35,296	SUCC scripts.py:960 -- --------------------
2024-08-07 10:14:35,297	INFO scripts.py:962 -- To terminate the Ray runtime, run
2024-08-07 10:14:35,297	INFO scripts.py:962 -- To terminate the Ray runtime, run
2024-08-07 10:14:35,297	INFO scripts.py:963 --   ray stop
2024-08-07 10:14:35,297	INFO scripts.py:963 --   ray stop
2024-08-07 10:14:35,297	INFO scripts.py:971 -- --block
2024-08-07 10:14:35,297	INFO scripts.py:971 -- --block
2024-08-07 10:14:35,297	INFO scripts.py:972 -- This command will now block forever until terminated by a signal.
2024-08-07 10:14:35,297	INFO scripts.py:972 -- This command will now block forever until terminated by a signal.
2024-08-07 10:14:35,297	INFO scripts.py:975 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.
2024-08-07 10:14:35,297	INFO scripts.py:975 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.
2024-08-07 10:14:35,297	SUCC scripts.py:958 -- --------------------
2024-08-07 10:14:35,297	SUCC scripts.py:959 -- Ray runtime started.
2024-08-07 10:14:35,297	SUCC scripts.py:960 -- --------------------
2024-08-07 10:14:35,297	INFO scripts.py:962 -- To terminate the Ray runtime, run
2024-08-07 10:14:35,297	SUCC scripts.py:958 -- --------------------
2024-08-07 10:14:35,297	INFO scripts.py:963 --   ray stop
2024-08-07 10:14:35,297	SUCC scripts.py:959 -- Ray runtime started.
2024-08-07 10:14:35,297	SUCC scripts.py:960 -- --------------------
2024-08-07 10:14:35,297	INFO scripts.py:962 -- To terminate the Ray runtime, run
2024-08-07 10:14:35,297	INFO scripts.py:963 --   ray stop
2024-08-07 10:14:35,297	SUCC scripts.py:958 -- --------------------
2024-08-07 10:14:35,297	INFO scripts.py:971 -- --block
2024-08-07 10:14:35,297	SUCC scripts.py:959 -- Ray runtime started.
2024-08-07 10:14:35,297	SUCC scripts.py:960 -- --------------------
2024-08-07 10:14:35,297	INFO scripts.py:962 -- To terminate the Ray runtime, run
2024-08-07 10:14:35,297	INFO scripts.py:963 --   ray stop
2024-08-07 10:14:35,297	INFO scripts.py:972 -- This command will now block forever until terminated by a signal.
2024-08-07 10:14:35,297	INFO scripts.py:975 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.
2024-08-07 10:14:35,297	INFO scripts.py:971 -- --block
2024-08-07 10:14:35,297	INFO scripts.py:972 -- This command will now block forever until terminated by a signal.
2024-08-07 10:14:35,297	INFO scripts.py:975 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.
2024-08-07 10:14:35,297	INFO scripts.py:971 -- --block
2024-08-07 10:14:35,297	INFO scripts.py:972 -- This command will now block forever until terminated by a signal.
2024-08-07 10:14:35,297	INFO scripts.py:975 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.
2024-08-07 10:14:35,298	SUCC scripts.py:958 -- --------------------
2024-08-07 10:14:35,298	SUCC scripts.py:959 -- Ray runtime started.
2024-08-07 10:14:35,298	SUCC scripts.py:958 -- --------------------
2024-08-07 10:14:35,298	SUCC scripts.py:960 -- --------------------
2024-08-07 10:14:35,298	SUCC scripts.py:959 -- Ray runtime started.
2024-08-07 10:14:35,298	INFO scripts.py:962 -- To terminate the Ray runtime, run
2024-08-07 10:14:35,298	SUCC scripts.py:960 -- --------------------
2024-08-07 10:14:35,298	INFO scripts.py:963 --   ray stop
2024-08-07 10:14:35,298	INFO scripts.py:962 -- To terminate the Ray runtime, run
2024-08-07 10:14:35,298	INFO scripts.py:963 --   ray stop
2024-08-07 10:14:35,298	INFO scripts.py:971 -- --block
2024-08-07 10:14:35,298	INFO scripts.py:971 -- --block
2024-08-07 10:14:35,298	INFO scripts.py:972 -- This command will now block forever until terminated by a signal.
2024-08-07 10:14:35,298	INFO scripts.py:975 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.
2024-08-07 10:14:35,298	INFO scripts.py:972 -- This command will now block forever until terminated by a signal.
2024-08-07 10:14:35,298	INFO scripts.py:975 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.
INFO 08-07 10:15:57 config.py:715] Defaulting to use ray for distributed inference
2024-08-07 10:15:57,398	INFO worker.py:1567 -- Connecting to existing Ray cluster at address: lrdn1699:6379...
2024-08-07 10:15:57,406	INFO worker.py:1752 -- Connected to Ray cluster.
INFO 08-07 10:16:00 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/leonardo_scratch/large/userexternal/gpuccett/models/hf_llama/llama-3.1-405b-instruct-hf', speculative_config=None, tokenizer='/leonardo_scratch/large/userexternal/gpuccett/models/hf_llama/llama-3.1-405b-instruct-hf', skip_tokenizer_init=False, tokenizer_mode=slow, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=32, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/leonardo_scratch/large/userexternal/gpuccett/models/hf_llama/llama-3.1-405b-instruct-hf, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-07 10:27:36 utils.py:784] Found nccl from library libnccl.so.2
INFO 08-07 10:27:36 pynccl.py:63] vLLM is using nccl==2.20.5
�[36m(RayWorkerWrapper pid=700343, ip=10.5.1.82)�[0m INFO 08-07 10:27:36 utils.py:784] Found nccl from library libnccl.so.2
�[36m(RayWorkerWrapper pid=700343, ip=10.5.1.82)�[0m INFO 08-07 10:27:36 pynccl.py:63] vLLM is using nccl==2.20.5
WARNING 08-07 10:27:40 custom_all_reduce.py:69] Custom allreduce is disabled because this process group spans across nodes.
�[36m(RayWorkerWrapper pid=669372, ip=10.5.1.92)�[0m WARNING 08-07 10:27:40 custom_all_reduce.py:69] Custom allreduce is disabled because this process group spans across nodes.
INFO 08-07 10:27:40 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='10.5.1.3', local_reader_ranks=[1, 2, 3], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x14e7518b0f50>, local_subscribe_port=40815, local_sync_port=38345, remote_subscribe_port=57791, remote_sync_port=35819)
INFO 08-07 10:27:40 model_runner.py:680] Starting to load model /leonardo_scratch/large/userexternal/gpuccett/models/hf_llama/llama-3.1-405b-instruct-hf...
�[36m(RayWorkerWrapper pid=700343, ip=10.5.1.82)�[0m INFO 08-07 10:27:40 model_runner.py:680] Starting to load model /leonardo_scratch/large/userexternal/gpuccett/models/hf_llama/llama-3.1-405b-instruct-hf...

Loading safetensors checkpoint shards:   0% Completed | 0/191 [00:00<?, ?it/s]

Loading safetensors checkpoint shards:   1% Completed | 1/191 [00:25<1:19:52, 25.22s/it]

Loading safetensors checkpoint shards:   1% Completed | 2/191 [00:25<33:40, 10.69s/it]
  
Loading safetensors checkpoint shards:   2% Completed | 3/191 [00:26<19:01,  6.07s/it]

Loading safetensors checkpoint shards:   2% Completed | 4/191 [00:48<38:15, 12.28s/it]

Loading safetensors checkpoint shards:   3% Completed | 5/191 [01:04<42:30, 13.71s/it]

Loading safetensors checkpoint shards:   3% Completed | 6/191 [01:25<49:51, 16.17s/it]

Loading safetensors checkpoint shards:   4% Completed | 7/191 [01:42<50:16, 16.39s/it]

Loading safetensors checkpoint shards:   4% Completed | 8/191 [01:44<36:47, 12.06s/it]

Loading safetensors checkpoint shards:   5% Completed | 9/191 [01:52<32:12, 10.62s/it]

Loading safetensors checkpoint shards:   5% Completed | 10/191 [01:52<22:42,  7.53s/it]

Loading safetensors checkpoint shards:   6% Completed | 11/191 [01:57<20:15,  6.75s/it]

Loading safetensors checkpoint shards:   6% Completed | 12/191 [01:58<14:27,  4.85s/it]

Loading safetensors checkpoint shards:   7% Completed | 13/191 [02:07<18:15,  6.15s/it]

Loading safetensors checkpoint shards:   7% Completed | 14/191 [02:11<16:10,  5.48s/it]

Loading safetensors checkpoint shards:   8% Completed | 15/191 [02:12<11:43,  4.00s/it]

Loading safetensors checkpoint shards:   8% Completed | 16/191 [02:18<13:43,  4.70s/it]

Loading safetensors checkpoint shards:   9% Completed | 17/191 [02:23<14:17,  4.93s/it]

Loading safetensors checkpoint shards:   9% Completed | 18/191 [02:27<13:06,  4.54s/it]

Loading safetensors checkpoint shards:  10% Completed | 19/191 [02:32<13:11,  4.60s/it]

Loading safetensors checkpoint shards:  10% Completed | 20/191 [02:32<09:36,  3.37s/it]

Loading safetensors checkpoint shards:  11% Completed | 21/191 [02:35<09:08,  3.23s/it]

Loading safetensors checkpoint shards:  12% Completed | 22/191 [02:40<10:42,  3.80s/it]

Loading safetensors checkpoint shards:  12% Completed | 23/191 [03:04<27:22,  9.78s/it]

Loading safetensors checkpoint shards:  13% Completed | 24/191 [03:10<23:53,  8.59s/it]

Loading safetensors checkpoint shards:  13% Completed | 25/191 [03:10<16:59,  6.14s/it]

Loading safetensors checkpoint shards:  14% Completed | 26/191 [03:18<18:04,  6.57s/it]

Loading safetensors checkpoint shards:  14% Completed | 27/191 [03:18<13:03,  4.78s/it]

Loading safetensors checkpoint shards:  15% Completed | 28/191 [03:20<10:09,  3.74s/it]

Loading safetensors checkpoint shards:  15% Completed | 29/191 [03:23<09:53,  3.66s/it]

Loading safetensors checkpoint shards:  16% Completed | 30/191 [03:27<09:50,  3.67s/it]

Loading safetensors checkpoint shards:  16% Completed | 31/191 [03:36<13:49,  5.19s/it]

Loading safetensors checkpoint shards:  17% Completed | 32/191 [03:36<09:59,  3.77s/it]

Loading safetensors checkpoint shards:  17% Completed | 33/191 [03:37<07:37,  2.90s/it]

Loading safetensors checkpoint shards:  18% Completed | 34/191 [03:41<08:13,  3.14s/it]

Loading safetensors checkpoint shards:  18% Completed | 35/191 [03:46<09:57,  3.83s/it]

Loading safetensors checkpoint shards:  19% Completed | 36/191 [03:55<13:24,  5.19s/it]

Loading safetensors checkpoint shards:  19% Completed | 37/191 [03:55<09:28,  3.69s/it]

Loading safetensors checkpoint shards:  20% Completed | 38/191 [03:55<07:00,  2.75s/it]

Loading safetensors checkpoint shards:  20% Completed | 39/191 [03:56<05:15,  2.08s/it]

Loading safetensors checkpoint shards:  21% Completed | 40/191 [04:02<08:00,  3.18s/it]

Loading safetensors checkpoint shards:  21% Completed | 41/191 [04:07<09:50,  3.94s/it]

Loading safetensors checkpoint shards:  22% Completed | 42/191 [04:12<10:19,  4.16s/it]

Loading safetensors checkpoint shards:  23% Completed | 43/191 [04:15<09:33,  3.88s/it]

Loading safetensors checkpoint shards:  23% Completed | 44/191 [04:15<06:54,  2.82s/it]

Loading safetensors checkpoint shards:  24% Completed | 45/191 [04:16<05:09,  2.12s/it]

Loading safetensors checkpoint shards:  24% Completed | 46/191 [04:23<08:32,  3.53s/it]

Loading safetensors checkpoint shards:  25% Completed | 47/191 [04:28<09:32,  3.98s/it]

Loading safetensors checkpoint shards:  25% Completed | 48/191 [04:32<09:40,  4.06s/it]

Loading safetensors checkpoint shards:  26% Completed | 49/191 [04:33<07:04,  2.99s/it]

Loading safetensors checkpoint shards:  26% Completed | 50/191 [04:40<10:17,  4.38s/it]

Loading safetensors checkpoint shards:  27% Completed | 51/191 [04:43<09:16,  3.97s/it]

Loading safetensors checkpoint shards:  27% Completed | 52/191 [04:44<06:50,  2.96s/it]

Loading safetensors checkpoint shards:  28% Completed | 53/191 [04:44<05:00,  2.18s/it]

Loading safetensors checkpoint shards:  28% Completed | 54/191 [04:44<03:39,  1.60s/it]

Loading safetensors checkpoint shards:  29% Completed | 55/191 [04:49<05:57,  2.63s/it]

Loading safetensors checkpoint shards:  29% Completed | 56/191 [04:55<07:37,  3.39s/it]

Loading safetensors checkpoint shards:  30% Completed | 57/191 [05:00<08:35,  3.84s/it]

Loading safetensors checkpoint shards:  30% Completed | 58/191 [05:04<09:05,  4.10s/it]

Loading safetensors checkpoint shards:  31% Completed | 59/191 [05:05<06:44,  3.07s/it]

Loading safetensors checkpoint shards:  31% Completed | 60/191 [05:06<05:07,  2.34s/it]

Loading safetensors checkpoint shards:  32% Completed | 61/191 [05:16<10:12,  4.71s/it]

Loading safetensors checkpoint shards:  32% Completed | 62/191 [05:16<07:29,  3.49s/it]

Loading safetensors checkpoint shards:  33% Completed | 63/191 [05:28<12:50,  6.02s/it]

Loading safetensors checkpoint shards:  34% Completed | 64/191 [05:29<09:18,  4.40s/it]

Loading safetensors checkpoint shards:  34% Completed | 65/191 [05:36<11:00,  5.24s/it]

Loading safetensors checkpoint shards:  35% Completed | 66/191 [05:43<11:49,  5.68s/it]

Loading safetensors checkpoint shards:  35% Completed | 67/191 [05:43<08:37,  4.17s/it]

Loading safetensors checkpoint shards:  36% Completed | 68/191 [06:11<22:53, 11.17s/it]

Loading safetensors checkpoint shards:  36% Completed | 69/191 [06:12<16:17,  8.01s/it]

Loading safetensors checkpoint shards:  37% Completed | 70/191 [06:35<25:26, 12.61s/it]

Loading safetensors checkpoint shards:  37% Completed | 71/191 [06:39<19:58,  9.99s/it]

Loading safetensors checkpoint shards:  38% Completed | 72/191 [06:44<16:40,  8.41s/it]

Loading safetensors checkpoint shards:  38% Completed | 73/191 [06:48<14:09,  7.20s/it]

Loading safetensors checkpoint shards:  39% Completed | 74/191 [06:48<10:05,  5.17s/it]

Loading safetensors checkpoint shards:  39% Completed | 75/191 [06:58<12:26,  6.44s/it]

Loading safetensors checkpoint shards:  40% Completed | 76/191 [07:03<11:48,  6.16s/it]

Loading safetensors checkpoint shards:  40% Completed | 77/191 [07:07<10:11,  5.36s/it]

Loading safetensors checkpoint shards:  41% Completed | 78/191 [07:10<09:01,  4.79s/it]

Loading safetensors checkpoint shards:  41% Completed | 79/191 [07:11<06:31,  3.50s/it]

Loading safetensors checkpoint shards:  42% Completed | 80/191 [07:14<06:07,  3.31s/it]

Loading safetensors checkpoint shards:  42% Completed | 81/191 [07:17<06:20,  3.46s/it]

Loading safetensors checkpoint shards:  43% Completed | 82/191 [07:27<09:45,  5.37s/it]

Loading safetensors checkpoint shards:  43% Completed | 83/191 [07:28<07:01,  3.90s/it]

Loading safetensors checkpoint shards:  44% Completed | 84/191 [07:28<05:10,  2.90s/it]

Loading safetensors checkpoint shards:  45% Completed | 85/191 [07:36<07:54,  4.47s/it]

Loading safetensors checkpoint shards:  45% Completed | 86/191 [07:41<08:05,  4.62s/it]

Loading safetensors checkpoint shards:  46% Completed | 87/191 [07:42<05:53,  3.40s/it]

Loading safetensors checkpoint shards:  46% Completed | 88/191 [07:49<07:35,  4.42s/it]

Loading safetensors checkpoint shards:  47% Completed | 89/191 [07:49<05:38,  3.32s/it]

Loading safetensors checkpoint shards:  47% Completed | 90/191 [07:55<06:28,  3.85s/it]

Loading safetensors checkpoint shards:  48% Completed | 91/191 [07:55<04:41,  2.81s/it]

Loading safetensors checkpoint shards:  48% Completed | 92/191 [07:59<05:13,  3.16s/it]

Loading safetensors checkpoint shards:  49% Completed | 93/191 [08:23<15:17,  9.36s/it]

Loading safetensors checkpoint shards:  49% Completed | 94/191 [08:33<15:40,  9.70s/it]
�[36m(RayWorkerWrapper pid=665758, ip=10.5.1.70)�[0m INFO 08-07 10:36:23 model_runner.py:692] Loading model weights took 24.4023 GB
�[36m(RayWorkerWrapper pid=665758, ip=10.5.1.70)�[0m INFO 08-07 10:27:36 utils.py:784] Found nccl from library libnccl.so.2�[32m [repeated 30x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)�[0m
�[36m(RayWorkerWrapper pid=665758, ip=10.5.1.70)�[0m INFO 08-07 10:27:36 pynccl.py:63] vLLM is using nccl==2.20.5�[32m [repeated 30x across cluster]�[0m
�[36m(RayWorkerWrapper pid=700257, ip=10.5.1.64)�[0m WARNING 08-07 10:27:40 custom_all_reduce.py:69] Custom allreduce is disabled because this process group spans across nodes.�[32m [repeated 30x across cluster]�[0m
�[36m(RayWorkerWrapper pid=665758, ip=10.5.1.70)�[0m INFO 08-07 10:27:40 model_runner.py:680] Starting to load model /leonardo_scratch/large/userexternal/gpuccett/models/hf_llama/llama-3.1-405b-instruct-hf...�[32m [repeated 30x across cluster]�[0m
�[36m(RayWorkerWrapper pid=665325, ip=10.5.1.70)�[0m INFO 08-07 10:36:31 model_runner.py:692] Loading model weights took 24.4023 GB

Loading safetensors checkpoint shards:  50% Completed | 95/191 [08:46<17:12, 10.76s/it]

Loading safetensors checkpoint shards:  50% Completed | 96/191 [09:00<18:32, 11.71s/it]

Loading safetensors checkpoint shards:  51% Completed | 97/191 [09:01<13:04,  8.34s/it]

Loading safetensors checkpoint shards:  51% Completed | 98/191 [09:07<12:02,  7.77s/it]

Loading safetensors checkpoint shards:  52% Completed | 99/191 [09:24<15:52, 10.35s/it]

Loading safetensors checkpoint shards:  52% Completed | 100/191 [09:44<20:13, 13.33s/it]

Loading safetensors checkpoint shards:  53% Completed | 101/191 [09:59<20:57, 13.97s/it]
�[36m(RayWorkerWrapper pid=669570, ip=10.5.1.92)�[0m INFO 08-07 10:37:44 model_runner.py:692] Loading model weights took 24.4023 GB

Loading safetensors checkpoint shards:  53% Completed | 102/191 [10:24<25:28, 17.17s/it]

Loading safetensors checkpoint shards:  54% Completed | 103/191 [10:25<17:54, 12.21s/it]

Loading safetensors checkpoint shards:  54% Completed | 104/191 [10:25<12:41,  8.76s/it]

Loading safetensors checkpoint shards:  55% Completed | 105/191 [10:30<10:55,  7.63s/it]

Loading safetensors checkpoint shards:  55% Completed | 106/191 [10:32<08:06,  5.72s/it]

Loading safetensors checkpoint shards:  56% Completed | 107/191 [10:32<05:48,  4.15s/it]

Loading safetensors checkpoint shards:  57% Completed | 108/191 [10:37<06:08,  4.43s/it]

Loading safetensors checkpoint shards:  57% Completed | 109/191 [10:41<05:50,  4.28s/it]

Loading safetensors checkpoint shards:  58% Completed | 110/191 [10:55<09:35,  7.10s/it]

Loading safetensors checkpoint shards:  58% Completed | 111/191 [10:56<06:53,  5.17s/it]

Loading safetensors checkpoint shards:  59% Completed | 112/191 [11:01<06:48,  5.17s/it]

Loading safetensors checkpoint shards:  59% Completed | 113/191 [11:05<06:24,  4.93s/it]

Loading safetensors checkpoint shards:  60% Completed | 114/191 [11:06<04:38,  3.62s/it]
�[36m(RayWorkerWrapper pid=685897, ip=10.5.1.43)�[0m INFO 08-07 10:39:01 model_runner.py:692] Loading model weights took 24.4023 GB

Loading safetensors checkpoint shards:  60% Completed | 115/191 [11:18<07:58,  6.29s/it]

Loading safetensors checkpoint shards:  61% Completed | 116/191 [11:19<05:37,  4.50s/it]

Loading safetensors checkpoint shards:  61% Completed | 117/191 [11:45<13:33, 10.99s/it]

Loading safetensors checkpoint shards:  62% Completed | 118/191 [11:51<11:51,  9.74s/it]

Loading safetensors checkpoint shards:  62% Completed | 119/191 [11:57<10:07,  8.44s/it]

Loading safetensors checkpoint shards:  63% Completed | 120/191 [12:02<08:39,  7.32s/it]

Loading safetensors checkpoint shards:  63% Completed | 121/191 [12:02<06:13,  5.33s/it]
�[36m(RayWorkerWrapper pid=665633, ip=10.5.1.70)�[0m INFO 08-07 10:39:49 model_runner.py:692] Loading model weights took 24.4023 GB

Loading safetensors checkpoint shards:  64% Completed | 122/191 [12:19<09:59,  8.69s/it]

Loading safetensors checkpoint shards:  64% Completed | 123/191 [12:19<07:02,  6.21s/it]

Loading safetensors checkpoint shards:  65% Completed | 124/191 [12:35<10:05,  9.03s/it]

Loading safetensors checkpoint shards:  65% Completed | 125/191 [12:40<08:33,  7.77s/it]

Loading safetensors checkpoint shards:  66% Completed | 126/191 [12:45<07:41,  7.09s/it]

Loading safetensors checkpoint shards:  66% Completed | 127/191 [12:49<06:28,  6.08s/it]

Loading safetensors checkpoint shards:  67% Completed | 128/191 [12:53<05:54,  5.63s/it]

Loading safetensors checkpoint shards:  68% Completed | 129/191 [12:58<05:36,  5.43s/it]

Loading safetensors checkpoint shards:  68% Completed | 130/191 [13:02<04:53,  4.80s/it]

Loading safetensors checkpoint shards:  69% Completed | 131/191 [13:02<03:29,  3.50s/it]

Loading safetensors checkpoint shards:  69% Completed | 132/191 [13:03<02:33,  2.59s/it]
�[36m(RayWorkerWrapper pid=669372, ip=10.5.1.92)�[0m INFO 08-07 10:40:49 model_runner.py:692] Loading model weights took 24.4023 GB�[32m [repeated 2x across cluster]�[0m

Loading safetensors checkpoint shards:  70% Completed | 133/191 [13:15<05:21,  5.54s/it]

Loading safetensors checkpoint shards:  70% Completed | 134/191 [13:20<05:11,  5.47s/it]

Loading safetensors checkpoint shards:  71% Completed | 135/191 [13:21<03:41,  3.96s/it]

Loading safetensors checkpoint shards:  71% Completed | 136/191 [13:24<03:28,  3.79s/it]

Loading safetensors checkpoint shards:  72% Completed | 137/191 [13:25<02:30,  2.79s/it]

Loading safetensors checkpoint shards:  72% Completed | 138/191 [13:26<02:06,  2.38s/it]

Loading safetensors checkpoint shards:  73% Completed | 139/191 [13:28<02:02,  2.36s/it]

Loading safetensors checkpoint shards:  73% Completed | 140/191 [13:29<01:31,  1.79s/it]

Loading safetensors checkpoint shards:  74% Completed | 141/191 [13:32<01:51,  2.23s/it]

Loading safetensors checkpoint shards:  74% Completed | 142/191 [13:33<01:22,  1.69s/it]

Loading safetensors checkpoint shards:  75% Completed | 143/191 [13:37<01:55,  2.42s/it]

Loading safetensors checkpoint shards:  75% Completed | 144/191 [13:40<02:01,  2.59s/it]

Loading safetensors checkpoint shards:  76% Completed | 145/191 [13:46<02:49,  3.69s/it]

Loading safetensors checkpoint shards:  76% Completed | 146/191 [13:46<01:59,  2.65s/it]

Loading safetensors checkpoint shards:  77% Completed | 147/191 [13:47<01:26,  1.97s/it]

Loading safetensors checkpoint shards:  77% Completed | 148/191 [13:50<01:46,  2.48s/it]

Loading safetensors checkpoint shards:  78% Completed | 149/191 [13:51<01:21,  1.95s/it]

Loading safetensors checkpoint shards:  79% Completed | 150/191 [13:51<01:01,  1.50s/it]

Loading safetensors checkpoint shards:  79% Completed | 151/191 [13:52<00:47,  1.18s/it]

Loading safetensors checkpoint shards:  80% Completed | 152/191 [13:52<00:35,  1.09it/s]

Loading safetensors checkpoint shards:  80% Completed | 153/191 [13:56<01:05,  1.73s/it]

Loading safetensors checkpoint shards:  81% Completed | 154/191 [14:01<01:47,  2.90s/it]

Loading safetensors checkpoint shards:  81% Completed | 155/191 [14:07<02:11,  3.65s/it]

Loading safetensors checkpoint shards:  82% Completed | 156/191 [14:07<01:33,  2.67s/it]

Loading safetensors checkpoint shards:  82% Completed | 157/191 [14:11<01:45,  3.11s/it]

Loading safetensors checkpoint shards:  83% Completed | 158/191 [14:12<01:18,  2.38s/it]

Loading safetensors checkpoint shards:  83% Completed | 159/191 [14:16<01:36,  3.00s/it]

Loading safetensors checkpoint shards:  84% Completed | 160/191 [14:22<01:56,  3.76s/it]

Loading safetensors checkpoint shards:  84% Completed | 161/191 [14:30<02:29,  4.98s/it]

Loading safetensors checkpoint shards:  85% Completed | 162/191 [14:35<02:23,  4.96s/it]

Loading safetensors checkpoint shards:  85% Completed | 163/191 [14:35<01:41,  3.63s/it]

Loading safetensors checkpoint shards:  86% Completed | 164/191 [14:39<01:35,  3.53s/it]

Loading safetensors checkpoint shards:  86% Completed | 165/191 [14:39<01:07,  2.61s/it]

Loading safetensors checkpoint shards:  87% Completed | 166/191 [14:42<01:10,  2.83s/it]

Loading safetensors checkpoint shards:  87% Completed | 167/191 [14:43<00:50,  2.12s/it]

Loading safetensors checkpoint shards:  88% Completed | 168/191 [14:48<01:12,  3.13s/it]

Loading safetensors checkpoint shards:  88% Completed | 169/191 [14:53<01:21,  3.72s/it]
�[36m(RayWorkerWrapper pid=200118)�[0m INFO 08-07 10:42:38 model_runner.py:692] Loading model weights took 24.4023 GB

Loading safetensors checkpoint shards:  89% Completed | 170/191 [14:58<01:21,  3.88s/it]

Loading safetensors checkpoint shards:  90% Completed | 171/191 [15:05<01:38,  4.91s/it]
�[36m(RayWorkerWrapper pid=669507, ip=10.5.1.92)�[0m INFO 08-07 10:42:51 model_runner.py:692] Loading model weights took 24.4023 GB

Loading safetensors checkpoint shards:  90% Completed | 172/191 [15:12<01:47,  5.66s/it]

Loading safetensors checkpoint shards:  91% Completed | 173/191 [15:18<01:43,  5.76s/it]

Loading safetensors checkpoint shards:  91% Completed | 174/191 [15:24<01:34,  5.58s/it]

Loading safetensors checkpoint shards:  92% Completed | 175/191 [15:24<01:04,  4.05s/it]

Loading safetensors checkpoint shards:  92% Completed | 176/191 [15:27<00:56,  3.78s/it]

Loading safetensors checkpoint shards:  93% Completed | 177/191 [15:32<00:57,  4.09s/it]

Loading safetensors checkpoint shards:  93% Completed | 178/191 [15:38<00:58,  4.53s/it]

Loading safetensors checkpoint shards:  94% Completed | 179/191 [15:40<00:47,  3.96s/it]

Loading safetensors checkpoint shards:  94% Completed | 180/191 [15:45<00:46,  4.20s/it]

Loading safetensors checkpoint shards:  95% Completed | 181/191 [15:49<00:40,  4.02s/it]

Loading safetensors checkpoint shards:  95% Completed | 182/191 [15:52<00:35,  3.94s/it]

Loading safetensors checkpoint shards:  96% Completed | 183/191 [15:57<00:33,  4.13s/it]

Loading safetensors checkpoint shards:  96% Completed | 184/191 [15:57<00:21,  3.04s/it]

Loading safetensors checkpoint shards:  97% Completed | 185/191 [16:03<00:23,  3.87s/it]

Loading safetensors checkpoint shards:  97% Completed | 186/191 [16:04<00:14,  2.84s/it]

Loading safetensors checkpoint shards:  98% Completed | 187/191 [16:16<00:22,  5.74s/it]

Loading safetensors checkpoint shards:  98% Completed | 188/191 [16:21<00:16,  5.45s/it]

Loading safetensors checkpoint shards:  99% Completed | 189/191 [16:26<00:10,  5.21s/it]

Loading safetensors checkpoint shards:  99% Completed | 190/191 [16:26<00:03,  3.81s/it]
�[36m(RayWorkerWrapper pid=200055)�[0m INFO 08-07 10:44:11 model_runner.py:692] Loading model weights took 24.4023 GB�[32m [repeated 2x across cluster]�[0m

Loading safetensors checkpoint shards: 100% Completed | 191/191 [16:27<00:00,  2.81s/it]

Loading safetensors checkpoint shards: 100% Completed | 191/191 [16:27<00:00,  5.17s/it]

INFO 08-07 10:44:12 model_runner.py:692] Loading model weights took 24.4023 GB
�[36m(RayWorkerWrapper pid=685465, ip=10.5.1.43)�[0m INFO 08-07 10:44:21 model_runner.py:692] Loading model weights took 24.4023 GB�[32m [repeated 2x across cluster]�[0m
�[36m(RayWorkerWrapper pid=685834, ip=10.5.1.43)�[0m INFO 08-07 10:45:24 model_runner.py:692] Loading model weights took 24.4023 GB
�[36m(RayWorkerWrapper pid=685772, ip=10.5.1.43)�[0m INFO 08-07 10:45:26 model_runner.py:692] Loading model weights took 24.4023 GB
�[36m(RayWorkerWrapper pid=728340, ip=10.6.0.55)�[0m INFO 08-07 10:46:02 model_runner.py:692] Loading model weights took 24.4023 GB
�[36m(RayWorkerWrapper pid=700478, ip=10.5.1.82)�[0m INFO 08-07 10:46:20 model_runner.py:692] Loading model weights took 24.4023 GB
�[36m(RayWorkerWrapper pid=700343, ip=10.5.1.82)�[0m INFO 08-07 10:48:55 model_runner.py:692] Loading model weights took 24.4023 GB�[32m [repeated 2x across cluster]�[0m
�[36m(RayWorkerWrapper pid=215673, ip=10.6.0.13)�[0m INFO 08-07 10:49:14 model_runner.py:692] Loading model weights took 24.4023 GB�[32m [repeated 4x across cluster]�[0m
�[36m(RayWorkerWrapper pid=728215, ip=10.6.0.55)�[0m INFO 08-07 10:49:37 model_runner.py:692] Loading model weights took 24.4023 GB�[32m [repeated 2x across cluster]�[0m
�[36m(RayWorkerWrapper pid=700194, ip=10.5.1.64)�[0m INFO 08-07 10:50:20 model_runner.py:692] Loading model weights took 24.4023 GB�[32m [repeated 3x across cluster]�[0m
�[36m(RayWorkerWrapper pid=699802, ip=10.5.1.64)�[0m INFO 08-07 10:50:27 model_runner.py:692] Loading model weights took 24.4023 GB�[32m [repeated 2x across cluster]�[0m
INFO 08-07 10:50:46 distributed_gpu_executor.py:56] # GPU blocks: 31360, # CPU blocks: 4161
�[36m(RayWorkerWrapper pid=700257, ip=10.5.1.64)�[0m INFO 08-07 10:50:49 model_runner.py:980] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
�[36m(RayWorkerWrapper pid=700257, ip=10.5.1.64)�[0m INFO 08-07 10:50:49 model_runner.py:984] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
�[36m(RayWorkerWrapper pid=700132, ip=10.5.1.64)�[0m INFO 08-07 10:50:27 model_runner.py:692] Loading model weights took 24.4023 GB
INFO 08-07 10:50:49 model_runner.py:980] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 08-07 10:50:49 model_runner.py:984] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
slurmstepd: error: *** STEP 6730439.10 ON lrdn1788 CANCELLED AT 2024-08-07T11:37:59 ***
slurmstepd: error: *** STEP 6730439.14 ON lrdn1855 CANCELLED AT 2024-08-07T11:37:59 ***
slurmstepd: error: *** STEP 6730439.8 ON lrdn1778 CANCELLED AT 2024-08-07T11:37:59 ***
srun: Job step aborted: Waiting up to 92 seconds for job step to finish.
srun: Job step aborted: Waiting up to 92 seconds for job step to finish.
srun: Job step aborted: Waiting up to 92 seconds for job step to finish.
srun: Job step aborted: Waiting up to 92 seconds for job step to finish.
slurmstepd: error: *** STEP 6730439.2 ON lrdn1739 CANCELLED AT 2024-08-07T11:38:02 ***
slurmstepd: error: *** STEP 6730439.4 ON lrdn1760 CANCELLED AT 2024-08-07T11:38:02 ***
srun: Job step aborted: Waiting up to 92 seconds for job step to finish.
srun: Job step aborted: Waiting up to 92 seconds for job step to finish.
srun: Job step aborted: Waiting up to 92 seconds for job step to finish.
srun: Job step aborted: Waiting up to 92 seconds for job step to finish.
slurmstepd: error: *** JOB 6730439 ON lrdn1699 CANCELLED AT 2024-08-07T11:38:02 ***
slurmstepd: error: *** STEP 6730439.0 ON lrdn1699 CANCELLED AT 2024-08-07T11:38:02 ***

@youkaichao
Copy link
Member

@gpucce did you try to add enforce_eager=True ? it seems multi-node tp has some issues, see #6783

@shreshtshettybs
Copy link

shreshtshettybs commented Aug 8, 2024

Hi, I am facing an issue with serving a fine tuned version of llama 3.1. The issue I have is not due to the deployment itself as the deployment works fine with this command:

sudo docker run --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=abc" -p 8000:8000 --ipc=host docker.io/vllm/vllm-openai:latest --model shress/llama3.1-16bit --max_model_len=8000

But when I send requests to the endpoint I get very random responses. I saw the logs on the server and noticed that the chat_template applied doesnt apply the tokens correctly based on the roles. Do we need to create manually a chat template and apply it while deployment? Or is there some llama 3.1 chat template offered by vllm that can automatically apply the necessary tokens for each role

@gpucce
Copy link

gpucce commented Aug 8, 2024

@gpucce did you try to add enforce_eager=True ? it seems multi-node tp has some issues, see #6783

@youkaichao Thanks!

Indeed that fixes it, I was able to generate text on 8 nodes and 32 gpus!
Any idea how much slower it gets?

Also if you know, is pipeline parallel going to be supported for the LLMEngine class at some point or never?

@youkaichao
Copy link
Member

Any idea how much slower it gets?

enforce_eager might be a little slower, but not much, given the model size.

you should use pipeline parallel with the api server directly, via vllm serve model -pp p -tp t

@gpucce
Copy link

gpucce commented Aug 8, 2024

Any idea how much slower it gets?

enforce_eager might be a little slower, but not much, given the model size.

you should use pipeline parallel with the api server directly, via vllm serve model -pp p -tp t

Unfortunately I can't easily do that because the nodes where I can run this have no access to the internet or any connection outside, so I would need to do some strange things to log in to the nodes myself or smth like that.

Thank you very much!

@simon-mo simon-mo unpinned this issue Aug 9, 2024
@ywang96
Copy link
Member

ywang96 commented Aug 16, 2024

For those who are having problem launching meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 on 8xH100s, HuggingFace has fixed the model repository with the correct number of kv heads, so if you should not have an issue with launching the model server with default vLLM args and the updated model weights!

@nivibilla
Copy link

llama 3.1 405b crashes when running 8 requests. Im running with tp64. very frustrating as it took 3 hrs to load the model from s3

INFO 08-22 17:52:17 model_runner.py:720] Starting to load model neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8/...
(RayWorkerWrapper pid=364392, ip=10.168.80.7) INFO 08-22 17:52:17 model_runner.py:720] Starting to load model neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8/...
Loading safetensors checkpoint shards:   0% Completed | 0/86 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   1% Completed | 1/86 [00:00<00:09,  8.91it/s]
Loading safetensors checkpoint shards:   2% Completed | 2/86 [00:00<00:11,  7.01it/s]
Loading safetensors checkpoint shards:   3% Completed | 3/86 [00:00<00:12,  6.42it/s]
Loading safetensors checkpoint shards:   5% Completed | 4/86 [00:00<00:13,  6.29it/s]
Loading safetensors checkpoint shards:   6% Completed | 5/86 [00:59<28:59, 21.48s/it]
Loading safetensors checkpoint shards:   7% Completed | 6/86 [02:27<58:37, 43.97s/it]
Loading safetensors checkpoint shards:   8% Completed | 7/86 [05:06<1:47:12, 81.42s/it]
Loading safetensors checkpoint shards:   9% Completed | 8/86 [06:32<1:47:53, 83.00s/it]
Loading safetensors checkpoint shards:  10% Completed | 9/86 [09:28<2:23:41, 111.97s/it]
Loading safetensors checkpoint shards:  12% Completed | 10/86 [11:08<2:17:15, 108.36s/it]
Loading safetensors checkpoint shards:  13% Completed | 11/86 [14:11<2:43:56, 131.15s/it]
Loading safetensors checkpoint shards:  14% Completed | 12/86 [15:41<2:26:26, 118.73s/it]
Loading safetensors checkpoint shards:  15% Completed | 13/86 [18:39<2:46:08, 136.56s/it]
Loading safetensors checkpoint shards:  16% Completed | 14/86 [20:17<2:29:59, 124.99s/it]
Loading safetensors checkpoint shards:  17% Completed | 15/86 [23:18<2:48:00, 141.98s/it]
Loading safetensors checkpoint shards:  19% Completed | 16/86 [24:50<2:27:56, 126.81s/it]
Loading safetensors checkpoint shards:  20% Completed | 17/86 [27:58<2:47:09, 145.35s/it]
Loading safetensors checkpoint shards:  21% Completed | 18/86 [29:25<2:24:51, 127.82s/it]
Loading safetensors checkpoint shards:  22% Completed | 19/86 [32:21<2:38:56, 142.33s/it]
Loading safetensors checkpoint shards:  23% Completed | 20/86 [33:50<2:18:51, 126.24s/it]
Loading safetensors checkpoint shards:  24% Completed | 21/86 [36:34<2:29:07, 137.65s/it]
Loading safetensors checkpoint shards:  26% Completed | 22/86 [37:57<2:09:07, 121.06s/it]
Loading safetensors checkpoint shards:  27% Completed | 23/86 [41:01<2:27:01, 140.03s/it]
Loading safetensors checkpoint shards:  28% Completed | 24/86 [42:37<2:10:53, 126.66s/it]
(RayWorkerWrapper pid=382632, ip=10.168.85.146) INFO 08-22 18:37:41 model_runner.py:732] Loading model weights took 6.4997 GB
(RayWorkerWrapper pid=406657, ip=10.168.90.165) INFO 08-22 17:52:16 utils.py:841] Found nccl from library libnccl.so.2 [repeated 62x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(RayWorkerWrapper pid=406657, ip=10.168.90.165) INFO 08-22 17:52:16 pynccl.py:63] vLLM is using nccl==2.22.3 [repeated 62x across cluster]
(RayWorkerWrapper pid=406657, ip=10.168.90.165) WARNING 08-22 17:52:17 custom_all_reduce.py:69] Custom allreduce is disabled because this process group spans across nodes. [repeated 62x across cluster]
(RayWorkerWrapper pid=408527, ip=10.168.88.194) INFO 08-22 17:52:17 model_runner.py:720] Starting to load model neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8/... [repeated 62x across cluster]
(RayWorkerWrapper pid=404707, ip=10.168.90.165) INFO 08-22 18:37:46 model_runner.py:732] Loading model weights took 6.4997 GB [repeated 21x across cluster]
Loading safetensors checkpoint shards:  29% Completed | 25/86 [45:28<2:22:33, 140.22s/it]
(RayWorkerWrapper pid=368011, ip=10.168.94.39) INFO 08-22 18:37:55 model_runner.py:732] Loading model weights took 6.4997 GB [repeated 27x across cluster]
Loading safetensors checkpoint shards:  30% Completed | 26/86 [46:58<2:04:58, 124.98s/it]
Loading safetensors checkpoint shards:  31% Completed | 27/86 [49:58<2:19:05, 141.45s/it]
Loading safetensors checkpoint shards:  33% Completed | 28/86 [51:20<1:59:32, 123.67s/it]
Loading safetensors checkpoint shards:  34% Completed | 29/86 [54:10<2:10:46, 137.66s/it]
Loading safetensors checkpoint shards:  35% Completed | 30/86 [55:39<1:54:49, 123.03s/it]
Loading safetensors checkpoint shards:  36% Completed | 31/86 [58:35<2:07:18, 138.88s/it]
Loading safetensors checkpoint shards:  37% Completed | 32/86 [59:57<1:49:31, 121.70s/it]
Loading safetensors checkpoint shards:  38% Completed | 33/86 [1:02:48<2:00:48, 136.76s/it]
Loading safetensors checkpoint shards:  40% Completed | 34/86 [1:04:11<1:44:23, 120.46s/it]
Loading safetensors checkpoint shards:  41% Completed | 35/86 [1:06:54<1:53:18, 133.30s/it]
Loading safetensors checkpoint shards:  42% Completed | 36/86 [1:08:29<1:41:23, 121.67s/it]
Loading safetensors checkpoint shards:  43% Completed | 37/86 [1:11:35<1:55:20, 141.23s/it]
Loading safetensors checkpoint shards:  44% Completed | 38/86 [1:12:55<1:38:04, 122.60s/it]
Loading safetensors checkpoint shards:  45% Completed | 39/86 [1:15:52<1:48:50, 138.94s/it]
Loading safetensors checkpoint shards:  47% Completed | 40/86 [1:17:16<1:34:03, 122.68s/it]
Loading safetensors checkpoint shards:  48% Completed | 41/86 [1:20:18<1:45:10, 140.23s/it]
Loading safetensors checkpoint shards:  49% Completed | 42/86 [1:21:49<1:32:06, 125.61s/it]
Loading safetensors checkpoint shards:  50% Completed | 43/86 [1:24:59<1:43:50, 144.89s/it]
Loading safetensors checkpoint shards:  51% Completed | 44/86 [1:26:26<1:29:14, 127.49s/it]
Loading safetensors checkpoint shards:  52% Completed | 45/86 [1:29:19<1:36:29, 141.22s/it]
Loading safetensors checkpoint shards:  53% Completed | 46/86 [1:31:04<1:26:49, 130.23s/it]
Loading safetensors checkpoint shards:  55% Completed | 47/86 [1:33:55<1:32:36, 142.47s/it]
Loading safetensors checkpoint shards:  56% Completed | 48/86 [1:35:34<1:22:01, 129.51s/it]
Loading safetensors checkpoint shards:  57% Completed | 49/86 [1:38:19<1:26:28, 140.24s/it]
Loading safetensors checkpoint shards:  58% Completed | 50/86 [1:39:48<1:14:54, 124.84s/it]
Loading safetensors checkpoint shards:  59% Completed | 51/86 [1:42:40<1:21:07, 139.07s/it]
Loading safetensors checkpoint shards:  60% Completed | 52/86 [1:44:03<1:09:11, 122.11s/it]
Loading safetensors checkpoint shards:  62% Completed | 53/86 [1:47:11<1:18:01, 141.87s/it]
Loading safetensors checkpoint shards:  63% Completed | 54/86 [1:48:32<1:05:51, 123.49s/it]
Loading safetensors checkpoint shards:  64% Completed | 55/86 [1:51:26<1:11:38, 138.66s/it]
Loading safetensors checkpoint shards:  65% Completed | 56/86 [1:52:47<1:00:42, 121.42s/it]
Loading safetensors checkpoint shards:  66% Completed | 57/86 [1:55:38<1:05:49, 136.20s/it]
Loading safetensors checkpoint shards:  67% Completed | 58/86 [1:57:08<57:05, 122.34s/it]
Loading safetensors checkpoint shards:  69% Completed | 59/86 [2:00:04<1:02:18, 138.48s/it]
Loading safetensors checkpoint shards:  70% Completed | 60/86 [2:01:49<55:38, 128.40s/it]
Loading safetensors checkpoint shards:  71% Completed | 61/86 [2:04:18<56:07, 134.70s/it]
Loading safetensors checkpoint shards:  72% Completed | 62/86 [2:05:22<45:26, 113.59s/it]
Loading safetensors checkpoint shards:  73% Completed | 63/86 [2:08:11<49:55, 130.25s/it]
Loading safetensors checkpoint shards:  74% Completed | 64/86 [2:09:45<43:42, 119.21s/it]
Loading safetensors checkpoint shards:  76% Completed | 65/86 [2:12:26<46:06, 131.74s/it]
Loading safetensors checkpoint shards:  77% Completed | 66/86 [2:13:59<40:02, 120.11s/it]
Loading safetensors checkpoint shards:  78% Completed | 67/86 [2:16:57<43:33, 137.53s/it]
Loading safetensors checkpoint shards:  79% Completed | 68/86 [2:18:27<36:56, 123.14s/it]
Loading safetensors checkpoint shards:  80% Completed | 69/86 [2:21:34<40:21, 142.44s/it]
Loading safetensors checkpoint shards:  81% Completed | 70/86 [2:23:11<34:21, 128.86s/it]
Loading safetensors checkpoint shards:  83% Completed | 71/86 [2:26:09<35:52, 143.48s/it]
Loading safetensors checkpoint shards:  84% Completed | 72/86 [2:27:31<29:09, 124.98s/it]
Loading safetensors checkpoint shards:  85% Completed | 73/86 [2:30:34<30:54, 142.62s/it]
Loading safetensors checkpoint shards:  86% Completed | 74/86 [2:32:04<25:22, 126.86s/it]
Loading safetensors checkpoint shards:  87% Completed | 75/86 [2:35:03<26:04, 142.22s/it]
Loading safetensors checkpoint shards:  88% Completed | 76/86 [2:36:31<21:00, 126.08s/it]
Loading safetensors checkpoint shards:  90% Completed | 77/86 [2:39:46<22:00, 146.74s/it]
Loading safetensors checkpoint shards:  91% Completed | 78/86 [2:41:18<17:22, 130.32s/it]
Loading safetensors checkpoint shards:  92% Completed | 79/86 [2:44:13<16:46, 143.80s/it]
Loading safetensors checkpoint shards:  93% Completed | 80/86 [2:45:47<12:53, 128.90s/it]
Loading safetensors checkpoint shards:  94% Completed | 81/86 [2:49:09<12:33, 150.63s/it]
Loading safetensors checkpoint shards:  95% Completed | 82/86 [2:50:35<08:45, 131.49s/it]
Loading safetensors checkpoint shards:  97% Completed | 83/86 [2:53:34<07:16, 145.59s/it]
Loading safetensors checkpoint shards:  98% Completed | 84/86 [2:55:11<04:22, 131.04s/it]
Loading safetensors checkpoint shards:  99% Completed | 85/86 [2:57:57<02:21, 141.57s/it]
Loading safetensors checkpoint shards: 100% Completed | 86/86 [2:58:00<00:00, 99.87s/it]
Loading safetensors checkpoint shards: 100% Completed | 86/86 [2:58:00<00:00, 124.19s/it]

(RayWorkerWrapper pid=886068) INFO 08-22 20:50:18 model_runner.py:732] Loading model weights took 6.4997 GB [repeated 8x across cluster]
INFO 08-22 20:50:18 model_runner.py:732] Loading model weights took 6.4997 GB
INFO 08-22 20:52:08 distributed_gpu_executor.py:56] # GPU blocks: 11593, # CPU blocks: 4161
WARNING 08-22 20:52:16 serving_embedding.py:171] embedding_mode is False. Embedding API will not work.
INFO 08-22 20:52:16 launcher.py:14] Available routes are:
INFO 08-22 20:52:16 launcher.py:22] Route: /openapi.json, Methods: GET, HEAD
INFO 08-22 20:52:16 launcher.py:22] Route: /docs, Methods: GET, HEAD
INFO 08-22 20:52:16 launcher.py:22] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 08-22 20:52:16 launcher.py:22] Route: /redoc, Methods: GET, HEAD
INFO 08-22 20:52:16 launcher.py:22] Route: /health, Methods: GET
INFO 08-22 20:52:16 launcher.py:22] Route: /tokenize, Methods: POST
INFO 08-22 20:52:16 launcher.py:22] Route: /detokenize, Methods: POST
INFO 08-22 20:52:16 launcher.py:22] Route: /v1/models, Methods: GET
INFO 08-22 20:52:16 launcher.py:22] Route: /version, Methods: GET
INFO 08-22 20:52:16 launcher.py:22] Route: /v1/chat/completions, Methods: POST
INFO 08-22 20:52:16 launcher.py:22] Route: /v1/completions, Methods: POST
INFO 08-22 20:52:16 launcher.py:22] Route: /v1/embeddings, Methods: POST
INFO:     Started server process [883391]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:1234/ (Press CTRL+C to quit)
INFO 08-22 20:52:26 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:52:36 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:52:46 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:52:56 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:53:06 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:53:16 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:53:26 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:53:36 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:53:46 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:53:56 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:54:06 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:54:16 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:54:26 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:54:36 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:54:46 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:54:56 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:55:06 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:55:16 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:55:26 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:55:36 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:55:46 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:55:56 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:56:06 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:56:16 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:56:26 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:56:36 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:56:46 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:56:56 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:57:06 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:57:16 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:57:26 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:57:36 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:57:46 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:57:56 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:58:06 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-22 20:58:13 logger.py:36] Received request chat-0ebfc35455b54c299145ac43d8e221c6: prompt: "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_

*** WARNING: max output size exceeded, skipping output. ***

INFO:     10.168.82.253:36612 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4c4a9c49-c7a7-4f4e-b457-7d7b662b9efc/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4c4a9c49-c7a7-4f4e-b457-7d7b662b9efc/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4c4a9c49-c7a7-4f4e-b457-7d7b662b9efc/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4c4a9c49-c7a7-4f4e-b457-7d7b662b9efc/lib/python3.11/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4c4a9c49-c7a7-4f4e-b457-7d7b662b9efc/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4c4a9c49-c7a7-4f4e-b457-7d7b662b9efc/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4c4a9c49-c7a7-4f4e-b457-7d7b662b9efc/lib/python3.11/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4c4a9c49-c7a7-4f4e-b457-7d7b662b9efc/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4c4a9c49-c7a7-4f4e-b457-7d7b662b9efc/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4c4a9c49-c7a7-4f4e-b457-7d7b662b9efc/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4c4a9c49-c7a7-4f4e-b457-7d7b662b9efc/lib/python3.11/site-packages/starlette/routing.py", line 754, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4c4a9c49-c7a7-4f4e-b457-7d7b662b9efc/lib/python3.11/site-packages/starlette/routing.py", line 774, in app
    await route.handle(scope, receive, send)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4c4a9c49-c7a7-4f4e-b457-7d7b662b9efc/lib/python3.11/site-packages/starlette/routing.py", line 295, in handle
    await self.app(scope, receive, send)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4c4a9c49-c7a7-4f4e-b457-7d7b662b9efc/lib/python3.11/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4c4a9c49-c7a7-4f4e-b457-7d7b662b9efc/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4c4a9c49-c7a7-4f4e-b457-7d7b662b9efc/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4c4a9c49-c7a7-4f4e-b457-7d7b662b9efc/lib/python3.11/site-packages/starlette/routing.py", line 74, in app
    response = await f(request)
               ^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4c4a9c49-c7a7-4f4e-b457-7d7b662b9efc/lib/python3.11/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4c4a9c49-c7a7-4f4e-b457-7d7b662b9efc/lib/python3.11/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4c4a9c49-c7a7-4f4e-b457-7d7b662b9efc/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 189, in create_chat_completion
    generator = await openai_serving_chat.create_chat_completion(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4c4a9c49-c7a7-4f4e-b457-7d7b662b9efc/lib/python3.11/site-packages/vllm/entrypoints/openai/serving_chat.py", line 185, in create_chat_completion
    return await self.chat_completion_full_generator(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4c4a9c49-c7a7-4f4e-b457-7d7b662b9efc/lib/python3.11/site-packages/vllm/entrypoints/openai/serving_chat.py", line 436, in chat_completion_full_generator
    async for res in result_generator:
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-4c4a9c49-c7a7-4f4e-b457-7d7b662b9efc/lib/python3.11/site-packages/vllm/entrypoints/openai/rpc/client.py", line 216, in generate
    raise request_output
vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.

@simon-mo simon-mo closed this as completed Sep 4, 2024
@timxzz
Copy link

timxzz commented Sep 6, 2024

For those who are having problem launching meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 on 8xH100s, HuggingFace has fixed the model repository with the correct number of kv heads, so if you should not have an issue with launching the model server with default vLLM args and the updated model weights!

I am using the vLLM version 0.6.0 and have pull the latest meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 from hugging face, but I still encounter the following error when launching the server on 8*H100 with the default args:
vllm serve Models/Meta-Llama-3.1-405B-Instruct-FP8 --tensor-parallel-size 8

ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (67184). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

@shreyasp-07
Copy link

shreyasp-07 commented Sep 9, 2024

I am trying to load the LLaMA 3.1 70B model on two different servers. The configuration of the servers is as follows:

  • Server-1: 6 GPUs, each with 24GB of memory.
  • Server-2: 4 GPUs, each with 24GB of memory.

When I load the model on Server-1 using the following command:

python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-8B-Instruct --trust-remote-code --device cuda --tensor-parallel-size 4 --gpu-memory-utilization 0.5 --swap-space 10 --dtype bfloat16 --api-key <openai-api-key> --enforce-eager

4 GPUs on Server-1are each allocated 11GB of memory.

Next, I attempted to distribute the model across Server-1 and Server-2 by connecting them via a Ray cluster. I used the following command:

python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-8B-Instruct --trust-remote-code --device cuda --tensor-parallel-size 4 --gpu-memory-utilization 0.5 --swap-space 10 --dtype bfloat16 --api-key <openai-api-key> --enforce-eager --pipeline-parallel-size 2

After this, I observed that:

  • On Server-1, 6 GPUs were each allocated 11GB.
  • On Server-2, each of the 4 GPUs was also allocated 11GB.

I expected that using more GPUs across both servers would reduce memory usage per GPU. However, the allocation remains at 11GB per GPU on both servers.

Can anyone clarify why this is happening, and why adding more GPUs doesn't seem to reduce the memory usage per GPU?

@tjohnson31415
Copy link
Contributor

Like @timxzz, I too was getting the error about there not being enough space for the KV Cache when running meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 on an 8xH100 node, even though it worked fine on 8xA100. After some investigation, I found that that most of the "extra" memory usage was coming from NCCL. By setting NCCL_NVLS_ENABLE=0, I was able to free up 3GiB memory for the KV Cache which let me get the full context length to fit. This env var disbales NVLink SHARP which is only availble for Hopper and newer REF; disabling it did not seem to affect performance in my quick testing.

@timxzz Hopefully setting NCCL_NVLS_ENABLE=0 helps in your case as well!

@UniverseFly
Copy link

Like @timxzz, I too was getting the error about there not being enough space for the KV Cache when running meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 on an 8xH100 node, even though it worked fine on 8xA100. After some investigation, I found that that most of the "extra" memory usage was coming from NCCL. By setting NCCL_NVLS_ENABLE=0, I was able to free up 3GiB memory for the KV Cache which let me get the full context length to fit. This env var disbales NVLink SHARP which is only availble for Hopper and newer REF; disabling it did not seem to affect performance in my quick testing.

@timxzz Hopefully setting NCCL_NVLS_ENABLE=0 helps in your case as well!

Wow hours spent and finally found this solution!

@timxzz
Copy link

timxzz commented Nov 12, 2024

Like @timxzz, I too was getting the error about there not being enough space for the KV Cache when running meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 on an 8xH100 node, even though it worked fine on 8xA100. After some investigation, I found that that most of the "extra" memory usage was coming from NCCL. By setting NCCL_NVLS_ENABLE=0, I was able to free up 3GiB memory for the KV Cache which let me get the full context length to fit. This env var disbales NVLink SHARP which is only availble for Hopper and newer REF; disabling it did not seem to affect performance in my quick testing.

@timxzz Hopefully setting NCCL_NVLS_ENABLE=0 helps in your case as well!

Amazing! This works! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests