#quantization

3 posts

ai guide Apr 1, 2026

llama.cpp — From Pure C++ to an LLM Inference Engine on Consumer Hardware

llama.cpp is the most widely used local LLM inference engine, implemented in pure C/C++. It supports CPU, Metal, CUDA, Vulkan, and other backends, and uses the GGUF quantization format to run multi-billion-parameter models on consumer hardware.

#llama-cpp #gguf #quantization #llm-inference #apple-silicon #metal #cuda #local-llm

ai guide Apr 1, 2026

TurboQuant+ — Two-Stage Quantization to Compress KV Cache to 2-bit, Running 100B Models on a MacBook

TurboQuant+ is an open-source implementation of a Google Research ICLR 2026 paper that uses PolarQuant + QJL two-stage quantization to compress the KV cache by 3.8-6.4x, enabling consumer hardware to run larger models with longer contexts.

#turboquant #kv-cache #quantization #llm-inference #llama-cpp #apple-silicon

ai guide Mar 31, 2026

Small Models That Run on Phones: Choices and Constraints in 2026

The main on-device LLMs in 2026 are Gemma 3n, Qwen 3.5 Small, Llama 3.2, Phi-4-mini, Ministral 3, and SmolLM3. Sub-3B quantized models can hit 30-50 tokens/sec on phones with 8GB RAM, but RAM, thermal throttling, and context window remain hard constraints.

#on-device-ai #small-models #mobile #quantization #llama #gemma #phi #qwen #mistral #smollm #mobilellm