llama.cpp is the most widely used local LLM inference engine, implemented in pure C/C++. It supports CPU, Metal, CUDA, Vulkan, and other backends, and uses the GGUF quantization format to run multi-billion-parameter models on consumer hardware.
TurboQuant+ is an open-source implementation of a Google Research ICLR 2026 paper that uses PolarQuant + QJL two-stage quantization to compress the KV cache by 3.8-6.4x, enabling consumer hardware to run larger models with longer contexts.
vLLM uses PagedAttention to eliminate KV cache memory waste, combining continuous batching and prefix caching to become the most widely adopted open-source LLM inference engine today.