Fast LLM speculative inference server for consumer hardware.
-
Updated
Jul 3, 2026 - C++
Fast LLM speculative inference server for consumer hardware.
Model compression toolkit engineered for enhanced usability, comprehensiveness, and efficiency.
DFlash & TurboQuant in llama.cpp with up to 3x faster generation and 7.5x more KV cache in same VRAM
Fully uncensored, capability-enhanced abliteration of Qwen3.6-27B. NVFP4 + z-lab DFlash speculative decoding (n=12) on the unified ghcr.io/aeon-7/aeon-vllm-ultimate:latest container, tuned for long-context draft acceptance on DGX Spark. 6 HF variants (BF16/NVFP4/MTP/MTP-XS), docker-compose, and QuickStart.
vLLM Qwen 3.6-27B (AWQ-INT4) + DFlash speculative decoding on AMD Strix Halo (gfx1151 iGPU, 128 GB UMA, ROCm 7.13). 24.8 t/s single-stream, vision, tool calling, 256K context, OpenAI-compatible, Docker. Matches DGX Spark FP8+DFlash+MTP at a third of the cost. No CUDA.
llama.cpp fork optimized for NVIDIA DGX Spark / GB10 (Blackwell, SM 12.1) — TurboQuant weights + KV, NVFP4, DFlash MTP
Local AI workstation — discover, run, chat, benchmark, and generate images from open-weight models. DFlash/DDTree speculative decoding, TurboQuant & TriAttention cache compression strategies, MLX + llama.cpp + vLLM + MTPLX backends.
Run Qwen 3.6-27B AWQ-INT4 models with DFlash speculative decoding on AMD Strix Halo hardware using vLLM for high-throughput inference.
Qwen3.6-27B BF16+DFlash 13-config parameter sweep on repne/vllm:v2. Stage A (3x3 buffer/graph) + Stage B (num_speculative_tokens) + Quality (HumanEval/MBPP). 7h, 421 problems, 195 cells.
GGUF-native DFlash speculative decoding runtime for local models
vLLM v0.21 + DFlash + thinking_token_budget for Gemma 4 & Qwen 3.6 on Blackwell GB10 (sm_121a / sm_120)
CLI for building and testing DFlash-style speculative decoding draft models.
Reproducible efficient-inference stack for Qwen3.5-4B (AdaptFM Efficient Qwen Competition): GPTQ W4A16 g128, untied W8 lm_head, DFlash speculative decoding, and per-step vLLM latency optimizations. 7.745× average latency speedup with all quality gates passing.
Cogni-Brain on DGX Spark: Qwen3.5-122B-A10B INT4+FP8 hybrid, DFlash speculative decoding, 262K context, ~54 tok/s, 100/100 Tool-Eval, vLLM.
Add a description, image, and links to the dflash topic page so that developers can more easily learn about it.
To associate your repository with the dflash topic, visit your repo's landing page and select "manage topics."