Model | GPU | MLC LLM (tok/sec) | Exllama (tok/sec) |
---|---|---|---|
Llama2-7B | RTX 3090 Ti | 154.1 | 116.38 |
Llama2-13B | RTX 3090 Ti | 93.1 | 70.45 |
Commit:
First of all, NVIDIA Docker is required: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#docker.
docker build -t mlc-perf:v0.1 .
First, log in to the docker container we created using the comamnd below:
PORT=45678
MODELS=$HOME/models/
docker run \
-d -P \
--gpus all \
-h mlc-perf \
--name mlc-perf \
-p $PORT:22 \
-v $MODELS:/models \
mlc-perf:v0.1
ssh [email protected] -p $PORT # password: mlc_llm_perf
Note: There might be security concerns to allow direct root login. Here we mainly want to simplify the process as a quick demo.
Then, compile Llama2 model using MLC inside the docker container:
micromamba activate python311
cd $MLC_HOME
python build.py \
--model /models/Llama-2/hf/Llama-2-7b-chat-hf \
--target cuda \
--quantization q4f16_1 \
--artifact-path "./dist" \
--use-cache 0
The quantized and compiled model will be exported to ./dist/Llama-2-7b-chat-hf-q4f16_1
.
Finally, run the model and see the performance numbers:
$MLC_HOME/build/mlc_chat_cli \
--model Llama-2-7b-chat-hf \
--quantization q4f16_1
Only decoding performance is currently benchmarked given prefilling usually takes much shorter time with flash attention.
Currently, MLC LLM number includes a long system prompt, while Exllama numbers are from a fixed-length system prompt of 4 tokens, which is not exactly apple-to-apple comparison. Should get it fixed.