FlashInfer: Kernel Library for LLM Serving
-
Updated
Mar 25, 2026 - Python
FlashInfer: Kernel Library for LLM Serving
A GPU cluster manager that configures and orchestrates inference engines like vLLM and SGLang for high-performance AI model deployment.
一個基於 llama.cpp 的分佈式 LLM 推理程式,讓您能夠利用區域網路內的多台電腦協同進行大型語言模型的分佈式推理,使用 Electron 的製作跨平台桌面應用程式操作 UI。
Code for paper "JMDC: A Joint Model and Data Compression System for Deep Neural Networks Collaborative Computing in Edge-Cloud Networks"
Analyze and generate unstructured data using LLMs, from quick experiments to billion token jobs.
Accelerate reproducible inference experiments for large language models with LLM-D! This lab automates the setup of a complete evaluation environment on OpenShift/OKD: GPU worker pools, core operators, observability, traffic control, and ready-to-run example workloads.
Source code of the paper "Private Collaborative Edge Inference via Over-the-Air Computation".
Super Ollama Load Balancer - Performance-aware routing for distributed Ollama deployments with Ray, Dask, and adaptive metrics
Pool your CUDA + ROCm GPUs into one OpenAI-compatible API. Speculative decoding proxy gives you 2-3x faster inference — for free, using hardware you already own. Stop renting GPU clouds. Be a tightwad.
Web UI for orchestrating distributed llama.cpp RPC GPU clusters with auto node discovery, telemetry, and one-click deployment.
Official impl. of ACM MM paper "Identity-Aware Attribute Recognition via Real-Time Distributed Inference in Mobile Edge Clouds". A distributed inference model for pedestrian attribute recognition with re-ID in an MEC-enabled camera monitoring system. Jointly training of pedestrian attribute recognition and Re-ID.
Turn any Kubernetes Cluster into a private LLM endpoint. One Helm command deploys distributed inference across commodity hardware. Raspberry Pi's, old servers, mixed architectures. OpenAI-Compatible API Powered by llama.cpp RPC
A comprehensive framework for multi-node, multi-GPU scalable LLM inference on HPC systems using vLLM and Ollama. Includes distributed deployment templates, benchmarking workflows, and chatbot/RAG pipelines for high-throughput, production-grade AI services
Encrypted Decentralized Inference and Learning (E.D.I.L.)
A cache-centric architecture, compatibility contracts, and protocols for KV cache handoff in LLM inference.
Distributed LLM inference across multiple machines. A central server routes OpenAI-compatible requests to llama.cpp client nodes, with automatic model distribution and mutual TLS security.
🚀 Master GPU kernel programming and optimization for high-performance AI systems with this comprehensive learning guide and resource hub.
Distributed inference across heterogeneous hardware.
Practical guide to clustering NVIDIA DGX Spark nodes for multi-node vLLM inference (NCCL, RoCE, Ray), with troubleshooting playbooks and step-by-step notebooks.
Add a description, image, and links to the distributed-inference topic page so that developers can more easily learn about it.
To associate your repository with the distributed-inference topic, visit your repo's landing page and select "manage topics."