A high-performance optimization tool for vLLM MoE (Mixture of Experts) model kernel tuning
A specialized toolkit for optimizing MoE model inference performance in the vLLM framework through automated Triton kernel parameter tuning, finding optimal execution configurations for different model architectures and hardware setups.
- 🔧 Automated Kernel Tuning: Use Ray distributed framework to automatically search for optimal Triton kernel configurations
- 📊 Multi-Model Support: Support mainstream MoE models including Mixtral, DeepSeek, Qwen, Jamba, etc.
- ⚡ Performance Optimization: Specialized optimization for different batch sizes and hardware configurations
- 🛠️ Fault Diagnosis: Complete environment checking and troubleshooting tools
- 📈 Result Analysis: Generate detailed performance reports and configuration recommendations
For users experiencing vLLM version compatibility issues, we provide benchmark_moe_fixed.py - a fully compatible version that resolves common API incompatibilities across different vLLM versions.
Key Improvements:
- ✅ Multi-level import fallback for
_get_config_dtype_strfunction - ✅ Dynamic parameter compatibility for
FusedMoEQuantConfig.make() - ✅ Automatic function signature detection for
fused_experts() - ✅ Clean English output (removes emoji and Chinese text)
- ✅ Production-ready logging and error handling
Usage:
# Use the fixed version instead of benchmark_moe.py
python benchmark_moe_fixed.py \
--model /path/to/your/model \
--tp-size 1 \
--dtype auto \
--batch-size 1 2 4 8 \
--tune \
--save-dir ./optimized_configs \
--trust-remote-codeWhen to use benchmark_moe_fixed.py:
- Encountering
ImportError: cannot import name '_get_config_dtype_str' - Getting
TypeError: FusedMoEQuantConfig.make() got an unexpected keyword argument - Facing
TypeError: fused_experts() got an unexpected keyword argument 'quant_config' - Running different vLLM versions (0.6.0 - 0.10.0+)
- Need clean English output for production environments
- Hardware: NVIDIA GPU (recommended A100/H100)
- Software: Ubuntu 18.04+, Python 3.11+, CUDA 11.8+
- Dependencies: vLLM 0.6.0+, PyTorch 2.0+, Ray
-
Clone the project
git clone https://github.com/massif-01/benchmark_moe.git cd benchmark_moe -
Environment check
bash scripts/server_check.sh
-
Run single model tuning
# Basic tuning - Qwen3 model python benchmark_moe.py \ --model /path/to/your/qwen3-model \ --tp-size 1 \ --dtype auto \ --batch-size 1 2 4 8 16 32 64 128 \ --tune \ --save-dir ./optimized_configs \ --trust-remote-code -
View results
ls ./optimized_configs/ # Output: E64N9472_tp1_fp16.json (example config file)
# Run environment check script
bash scripts/server_check.sh
# Check GPU status
nvidia-smi
# Check Python dependencies
python -c "import vllm, ray, torch, triton; print('Environment check passed')"Issue 1: Triton Cache Corruption
# Clear Triton cache (if encountering JSONDecodeError)
rm -rf ~/.triton/cache/*
# Or set new cache directory
export TRITON_CACHE_DIR=/tmp/triton_cache_$(date +%s)
mkdir -p $TRITON_CACHE_DIRIssue 2: libstdc++ Version Issues
# Update libstdc++ in conda environment
conda install -c conda-forge libstdcxx-ng
# Or use system library
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATHIssue 3: Ray Warnings
# Suppress Ray-related warnings
export RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
export RAY_DISABLE_IMPORT_WARNING=1--model: Model path or HuggingFace model name--tp-size: Tensor parallelism degree (set according to GPU count)--dtype: Data type (auto,fp8_w8a8,int8_w8a16)--batch-size: List of batch sizes to test--tune: Enable tuning mode--save-dir: Configuration file save directory
--use-deep-gemm: Enable DeepGEMM optimization--enable-expert-parallel: Enable expert parallelism (for large models)--seed: Random seed (ensures reproducible results)
| Model Series | Experts | Top-K | Recommended VRAM | Example Command |
|---|---|---|---|---|
| Qwen3-30B-A3B | 64 | 4 | 64GB+ | --model path/to/qwen3 --tp-size 1 |
| Mixtral-8x7B | 8 | 2 | 45GB+ | --model mistralai/Mixtral-8x7B-Instruct-v0.1 --tp-size 2 |
| DeepSeek-V2 | 160 | 6 | 80GB+ | --model deepseek-ai/DeepSeek-V2-Chat --tp-size 4 |
| DeepSeek-V3 | 256 | 8 | 120GB+ | --model deepseek-ai/DeepSeek-V3-Base --tp-size 8 |
# List supported models
python tools/config_manager.py list
# Tune specific model
python tools/config_manager.py tune qwen3_30b
# View configuration recommendations
python tools/config_manager.py recommend qwen3_30b# Use safe script to test batch sizes one by one
bash scripts/run_benchmark_safe.sh{
"triton_version": "2.1.0",
"1": { // Optimal config for batch_size=1
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 64,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 3
},
"64": { // Optimal config for batch_size=64
"BLOCK_SIZE_M": 128,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 256,
"GROUP_SIZE_M": 32,
"num_warps": 8,
"num_stages": 4
}
}# Run performance benchmark (without --tune)
python benchmark_moe.py \
--model your_model \
--tp-size 1 \
--batch-size 1 2 4 8 16 32 64 128
# Example output:
# Batch size: 1, Kernel time: 45.23 us
# Batch size: 64, Kernel time: 892.15 us# Symptom: CUDA out of memory
# Solutions:
# - Reduce batch size
--batch-size 1 2 4 8 16 32
# - Use quantization
--dtype fp8_w8a8
# - Increase tensor parallelism (if multi-GPU)
--tp-size 2# Symptom: version `GLIBCXX_3.4.30' not found
# Quick fix:
bash scripts/fix_libstdcxx.sh
# Or manual solutions:
conda install -c conda-forge libstdcxx-ng=12
# OR
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH# Symptom: JSONDecodeError, OutOfResources
# Solutions:
rm -rf ~/.triton/cache/*
export TRITON_CACHE_DIR=/tmp/triton_cache_new# Symptom: Model not found, Permission denied
# Solutions:
# - Check model path
ls /path/to/your/model
# - Add access permissions
--trust-remote-code
# - Pre-download model
huggingface-cli download model_name --local-dir ./models/# Symptom: Ray cannot start
# Solution:
export RAY_DISABLE_IMPORT_WARNING=1
ray stop # Stop existing instance
ray start --head # RestartLow Latency Scenarios (Online Inference)
# Optimize small batch performance
--batch-size 1 2 4 8 16
--dtype fp8_w8a8 # Reduce memory access latencyHigh Throughput Scenarios (Batch Processing)
# Optimize large batch performance
--batch-size 64 128 256 512 1024
--dtype auto # Balance precision and performanceMemory-Constrained Scenarios
# Maximize memory utilization
--dtype fp8_w8a8
--use-deep-gemm
--enable-expert-parallelbenchmark_moe/
├── README.md # English documentation
├── README_zh.md # Chinese documentation
├── LICENSE # Apache-2.0 license
├── .gitignore # Git ignore rules
├── benchmark_moe.py # vLLM MoE benchmark core script
├── scripts/ # Script directory
│ ├── server_check.sh # Server environment check script
│ ├── tune_mixtral.sh # Mixtral model tuning script
│ ├── tune_deepseek.sh # DeepSeek model tuning script
│ └── run_benchmark_safe.sh # Safe batch tuning script
├── configs/ # Configuration directory
│ ├── models.json # Supported model configurations
│ └── tuning_params.json # Tuning parameter configurations
├── tools/ # Analysis tools directory
│ └── config_manager.py # Configuration management tool
└── deployment/ # Deployment related files
├── requirements.txt # Python dependencies list
└── DEPLOYMENT_GUIDE.md # Detailed deployment guide
Contributions are welcome! Please feel free to submit Issues and Pull Requests.
git clone https://github.com/massif-01/benchmark_moe.git
cd benchmark_moe
pip install -r deployment/requirements.txt- Fork this project
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the Apache-2.0 License. See the LICENSE file for details.
- vLLM - High-performance LLM inference engine
- Ray - Distributed computing framework
- Triton - GPU programming language
For questions or suggestions, please contact us via:
- Submit GitHub Issue
- Start a Discussion
⭐ If this project helps you, please give us a Star!