Skip to content
All tags

#apple-silicon

2 posts
ai guide

llama.cpp — From Pure C++ to an LLM Inference Engine on Consumer Hardware

llama.cpp is the most widely used local LLM inference engine, implemented in pure C/C++. It supports CPU, Metal, CUDA, Vulkan, and other backends, and uses the GGUF quantization format to run multi-billion-parameter models on consumer hardware.

ai guide

TurboQuant+ — Two-Stage Quantization to Compress KV Cache to 2-bit, Running 100B Models on a MacBook

TurboQuant+ is an open-source implementation of a Google Research ICLR 2026 paper that uses PolarQuant + QJL two-stage quantization to compress the KV cache by 3.8-6.4x, enabling consumer hardware to run larger models with longer contexts.