benchmark_moe

A high-performance optimization tool for vLLM MoE (Mixture of Experts) model kernel tuning

A specialized toolkit for optimizing MoE model inference performance in the vLLM framework through automated Triton kernel parameter tuning, finding optimal execution configurations for different model architectures and hardware setups.

🎯 Key Features

🔧 Automated Kernel Tuning: Use Ray distributed framework to automatically search for optimal Triton kernel configurations
📊 Multi-Model Support: Support mainstream MoE models including Mixtral, DeepSeek, Qwen, Jamba, etc.
⚡ Performance Optimization: Specialized optimization for different batch sizes and hardware configurations
🛠️ Fault Diagnosis: Complete environment checking and troubleshooting tools
📈 Result Analysis: Generate detailed performance reports and configuration recommendations

🆕 Version Compatibility

benchmark_moe_fixed.py - Enhanced Compatibility Version

For users experiencing vLLM version compatibility issues, we provide benchmark_moe_fixed.py - a fully compatible version that resolves common API incompatibilities across different vLLM versions.

Key Improvements:

✅ Multi-level import fallback for _get_config_dtype_str function
✅ Dynamic parameter compatibility for FusedMoEQuantConfig.make()
✅ Automatic function signature detection for fused_experts()
✅ Clean English output (removes emoji and Chinese text)
✅ Production-ready logging and error handling

Usage:

# Use the fixed version instead of benchmark_moe.py
python benchmark_moe_fixed.py \
  --model /path/to/your/model \
  --tp-size 1 \
  --dtype auto \
  --batch-size 1 2 4 8 \
  --tune \
  --save-dir ./optimized_configs \
  --trust-remote-code

When to use benchmark_moe_fixed.py:

Encountering ImportError: cannot import name '_get_config_dtype_str'
Getting TypeError: FusedMoEQuantConfig.make() got an unexpected keyword argument
Facing TypeError: fused_experts() got an unexpected keyword argument 'quant_config'
Running different vLLM versions (0.6.0 - 0.10.0+)
Need clean English output for production environments

🚀 Quick Start

Prerequisites

Hardware: NVIDIA GPU (recommended A100/H100)
Software: Ubuntu 18.04+, Python 3.11+, CUDA 11.8+
Dependencies: vLLM 0.6.0+, PyTorch 2.0+, Ray

Installation

Clone the project

git clone https://github.com/massif-01/benchmark_moe.git
cd benchmark_moe

Environment check
```
bash scripts/server_check.sh
```

Run single model tuning

# Basic tuning - Qwen3 model
python benchmark_moe.py \
  --model /path/to/your/qwen3-model \
  --tp-size 1 \
  --dtype auto \
  --batch-size 1 2 4 8 16 32 64 128 \
  --tune \
  --save-dir ./optimized_configs \
  --trust-remote-code

View results

ls ./optimized_configs/
# Output: E64N9472_tp1_fp16.json (example config file)

📋 Detailed Usage Guide

Environment Setup

System Environment Check

# Run environment check script
bash scripts/server_check.sh

# Check GPU status
nvidia-smi

# Check Python dependencies
python -c "import vllm, ray, torch, triton; print('Environment check passed')"

Common Environment Issues

Issue 1: Triton Cache Corruption

# Clear Triton cache (if encountering JSONDecodeError)
rm -rf ~/.triton/cache/*

# Or set new cache directory
export TRITON_CACHE_DIR=/tmp/triton_cache_$(date +%s)
mkdir -p $TRITON_CACHE_DIR

Issue 2: libstdc++ Version Issues

# Update libstdc++ in conda environment
conda install -c conda-forge libstdcxx-ng

# Or use system library
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH

Issue 3: Ray Warnings

# Suppress Ray-related warnings
export RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
export RAY_DISABLE_IMPORT_WARNING=1

Tuning Parameters

Basic Parameters

--model: Model path or HuggingFace model name
--tp-size: Tensor parallelism degree (set according to GPU count)
--dtype: Data type (auto, fp8_w8a8, int8_w8a16)
--batch-size: List of batch sizes to test
--tune: Enable tuning mode
--save-dir: Configuration file save directory

Advanced Parameters

--use-deep-gemm: Enable DeepGEMM optimization
--enable-expert-parallel: Enable expert parallelism (for large models)
--seed: Random seed (ensures reproducible results)

Supported Model Types

Model Series	Experts	Top-K	Recommended VRAM	Example Command
Qwen3-30B-A3B	64	4	64GB+	`--model path/to/qwen3 --tp-size 1`
Mixtral-8x7B	8	2	45GB+	`--model mistralai/Mixtral-8x7B-Instruct-v0.1 --tp-size 2`
DeepSeek-V2	160	6	80GB+	`--model deepseek-ai/DeepSeek-V2-Chat --tp-size 4`
DeepSeek-V3	256	8	120GB+	`--model deepseek-ai/DeepSeek-V3-Base --tp-size 8`

Batch Tuning Scripts

Using Configuration Management Tool

# List supported models
python tools/config_manager.py list

# Tune specific model
python tools/config_manager.py tune qwen3_30b

# View configuration recommendations
python tools/config_manager.py recommend qwen3_30b

Safe Batch Tuning

# Use safe script to test batch sizes one by one
bash scripts/run_benchmark_safe.sh

📊 Result Interpretation

Configuration File Format

{
  "triton_version": "2.1.0",
  "1": {                    // Optimal config for batch_size=1
    "BLOCK_SIZE_M": 16,
    "BLOCK_SIZE_N": 64, 
    "BLOCK_SIZE_K": 128,
    "GROUP_SIZE_M": 1,
    "num_warps": 4,
    "num_stages": 3
  },
  "64": {                   // Optimal config for batch_size=64
    "BLOCK_SIZE_M": 128,
    "BLOCK_SIZE_N": 128,
    "BLOCK_SIZE_K": 256,
    "GROUP_SIZE_M": 32,
    "num_warps": 8,
    "num_stages": 4
  }
}

Performance Analysis

# Run performance benchmark (without --tune)
python benchmark_moe.py \
  --model your_model \
  --tp-size 1 \
  --batch-size 1 2 4 8 16 32 64 128

# Example output:
# Batch size: 1, Kernel time: 45.23 us
# Batch size: 64, Kernel time: 892.15 us

🛠️ Troubleshooting

Common Issues and Solutions

1. Memory Issues

# Symptom: CUDA out of memory
# Solutions:
# - Reduce batch size
--batch-size 1 2 4 8 16 32

# - Use quantization
--dtype fp8_w8a8

# - Increase tensor parallelism (if multi-GPU)
--tp-size 2

2. libstdc++ Compatibility Error

# Symptom: version `GLIBCXX_3.4.30' not found
# Quick fix:
bash scripts/fix_libstdcxx.sh

# Or manual solutions:
conda install -c conda-forge libstdcxx-ng=12
# OR
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH

3. Triton Compilation Errors

# Symptom: JSONDecodeError, OutOfResources
# Solutions:
rm -rf ~/.triton/cache/*
export TRITON_CACHE_DIR=/tmp/triton_cache_new

3. Model Loading Failures

# Symptom: Model not found, Permission denied
# Solutions:
# - Check model path
ls /path/to/your/model

# - Add access permissions
--trust-remote-code

# - Pre-download model
huggingface-cli download model_name --local-dir ./models/

4. Ray Initialization Issues

# Symptom: Ray cannot start
# Solution:
export RAY_DISABLE_IMPORT_WARNING=1
ray stop  # Stop existing instance
ray start --head  # Restart

Performance Tuning Recommendations

For Different Use Cases

Low Latency Scenarios (Online Inference)

# Optimize small batch performance
--batch-size 1 2 4 8 16
--dtype fp8_w8a8  # Reduce memory access latency

High Throughput Scenarios (Batch Processing)

# Optimize large batch performance  
--batch-size 64 128 256 512 1024
--dtype auto  # Balance precision and performance

Memory-Constrained Scenarios

# Maximize memory utilization
--dtype fp8_w8a8
--use-deep-gemm
--enable-expert-parallel

📁 Project Structure

benchmark_moe/
├── README.md                   # English documentation
├── README_zh.md               # Chinese documentation  
├── LICENSE                     # Apache-2.0 license
├── .gitignore                  # Git ignore rules
├── benchmark_moe.py           # vLLM MoE benchmark core script
├── scripts/                   # Script directory
│   ├── server_check.sh        # Server environment check script
│   ├── tune_mixtral.sh        # Mixtral model tuning script
│   ├── tune_deepseek.sh       # DeepSeek model tuning script
│   └── run_benchmark_safe.sh  # Safe batch tuning script
├── configs/                   # Configuration directory
│   ├── models.json            # Supported model configurations
│   └── tuning_params.json     # Tuning parameter configurations
├── tools/                     # Analysis tools directory
│   └── config_manager.py      # Configuration management tool
└── deployment/                # Deployment related files
    ├── requirements.txt       # Python dependencies list
    └── DEPLOYMENT_GUIDE.md    # Detailed deployment guide

🤝 Contributing

Contributions are welcome! Please feel free to submit Issues and Pull Requests.

Development Environment Setup

git clone https://github.com/massif-01/benchmark_moe.git
cd benchmark_moe
pip install -r deployment/requirements.txt

Contributing Code

Fork this project
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

📄 License

This project is licensed under the Apache-2.0 License. See the LICENSE file for details.

🙏 Acknowledgments

vLLM - High-performance LLM inference engine
Ray - Distributed computing framework
Triton - GPU programming language

📮 Contact

For questions or suggestions, please contact us via:

Submit GitHub Issue
Start a Discussion

⭐ If this project helps you, please give us a Star!

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
configs		configs
deployment		deployment
scripts		scripts
tools		tools
.gitignore		.gitignore
COMPATIBILITY_FIXES.md		COMPATIBILITY_FIXES.md
LICENSE		LICENSE
QUEUE_USAGE_EXAMPLES.md		QUEUE_USAGE_EXAMPLES.md
README.md		README.md
README_zh.md		README_zh.md
benchmark_moe.py		benchmark_moe.py
benchmark_moe_fixed.py		benchmark_moe_fixed.py
demo_queue_usage.sh		demo_queue_usage.sh

License

massif-01/benchmark_moe

Folders and files

Latest commit

History

Repository files navigation

benchmark_moe

🎯 Key Features

🆕 Version Compatibility

benchmark_moe_fixed.py - Enhanced Compatibility Version

🚀 Quick Start

Prerequisites

Installation

📋 Detailed Usage Guide

Environment Setup

System Environment Check

Common Environment Issues

Tuning Parameters

Basic Parameters

Advanced Parameters

Supported Model Types

Batch Tuning Scripts

Using Configuration Management Tool

Safe Batch Tuning

📊 Result Interpretation

Configuration File Format

Performance Analysis

🛠️ Troubleshooting

Common Issues and Solutions

Common Issues and Solutions

1. Memory Issues

2. libstdc++ Compatibility Error

3. Triton Compilation Errors

3. Model Loading Failures

4. Ray Initialization Issues

Performance Tuning Recommendations

For Different Use Cases

📁 Project Structure

🤝 Contributing

Development Environment Setup

Contributing Code

📄 License

🙏 Acknowledgments

📮 Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages