Fast, lightweight, pure C/C++ HTTP server based on httplib, nlohmann::json and llama.cpp. Set of LLM REST APIs and a simple web front end to interact with llama.cpp. Features: LLM inference of F16 and quantized models on GPU and CPU OpenAI API compatible chat completions and embeddings routes Parallel decoding with multi-user support Continuous batching Multimodal (wip) Monitoring endpoints Schema