GitHub - zxybazh/llm-perf-bench

LLM Performance Benchmarking

Performance

Model	GPU	MLC LLM (tok/sec)	Exllama (tok/sec)
Llama2-7B	RTX 3090 Ti	166.7	112.72
Llama2-13B	RTX 3090 Ti	99.2	69.31
Llama2-7B	RTX 4090	191.0	152.56
Llama2-13B	RTX 4090	108.8	93.88

Commit:

MLC LLM commit, TVM commit;
Exllama commit.

Instructions

First of all, NVIDIA Docker is required: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#docker.

MLC LLM

Step 1. Build Docker image

docker build -t llm-perf-mlc:v0.1 -f Dockerfile.cu121.mlc .

Step 2. Quantize and run Llama2. Log in to the docker container we created using the comamnd below:

PORT=45678
MODELS=/PATH/TO/MODEL/ # Replace the path to HuggingFace models

docker run            \
  -d -P               \
  --gpus all          \
  -h llm-perf         \
  --name llm-perf     \
  -p $PORT:22         \
  -v $MODELS:/models  \
  llm-perf-mlc:v0.1

# Password is: llm_perf
ssh [email protected] -p $PORT

# Inside the container, run the following commands:
micromamba activate python311

cd $MLC_HOME
python build.py \
  --model /models/Llama-2-7b-chat-hf \  # Replace it with path to HuggingFace models
  --target cuda \
  --quantization q4f16_1 \
  --artifact-path "./dist" \
  --use-cache 0

The quantized and compiled model will be exported to ./dist/Llama-2-7b-chat-hf-q4f16_1.

Step 3. Run the CLI tool to see the performance numbers:

$MLC_HOME/build/mlc_chat_cli \
  --model Llama-2-7b-chat-hf \
  --quantization q4f16_1

Exllama

TBD

Llama.cpp

TBD

TODOs

Only decoding performance is currently benchmarked given prefilling usually takes much shorter time with flash attention.

Currently, MLC LLM number includes a long system prompt, while Exllama numbers are from a fixed-length system prompt of 4 tokens, which is not exactly apple-to-apple comparison. Should get it fixed.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
install		install
Dockerfile.cu121.mlc		Dockerfile.cu121.mlc
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Performance Benchmarking

Performance

Instructions

MLC LLM

Exllama

Llama.cpp

TODOs

About

Releases

Packages

Languages

zxybazh/llm-perf-bench

Folders and files

Latest commit

History

Repository files navigation

LLM Performance Benchmarking

Performance

Instructions

MLC LLM

Exllama

Llama.cpp

TODOs

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages