Skip to content

llama.cpp — From Pure C++ to an LLM Inference Engine on Consumer Hardware

Apr 1, 2026 1 min
TL;DR llama.cpp is the most widely used local LLM inference engine, implemented in pure C/C++. It supports CPU, Metal, CUDA, Vulkan, and other backends, and uses the GGUF quantization format to run multi-billion-parameter models on consumer hardware.

🌏 中文版

Ollama lets you run a model with a single command, and the engine doing the actual work under the hood is llama.cpp. If you’ve used Ollama but found it too opaque — and want more control over quantization formats, GPU layer offloading, speculative decoding, or building directly against an OpenAI-compatible API — llama.cpp is the layer you need to dig into.

This post covers llama.cpp’s core architecture, the logic behind choosing quantization formats, differences across hardware backends, practical usage, and where it stands in 2026.

Core Architecture: GGML + GGUF

llama.cpp was started by Georgi Gerganov in March 2023, originally just to run LLaMA on a MacBook. As of March 2026, it has 100K+ GitHub stars, 1,039+ contributors, and 4,828+ commits. In February 2026, Gerganov and the ggml.ai team officially joined Hugging Face, and the project moved from ggerganov/llama.cpp to ggml-org/llama.cpp.

The core design decision: no Python dependency, no PyTorch dependency, no CUDA dependency — tensor operations are implemented from scratch in pure C/C++.

The foundation is GGML (Georgi Gerganov Machine Learning), an extremely lightweight tensor library:

  • 16 quantized data types
  • Computation graph model (DAG) — define once, execute many times
  • Context-based contiguous memory allocation, cache-friendly
  • Pluggable backend architecture (runtime auto-detection of optimal hardware)
  • Zero runtime dependencies — compiles to a single binary

Models are stored in GGUF (GGML Unified Format), a self-describing binary format:

GGUF File Structure:
┌─────────────────────────┐
│ Magic Number (GGUF)     │
│ Version                 │
│ Metadata (key-value)    │  ← Model architecture, quantization type, tokenizer, context length...
│ Tensor Info             │  ← Name, shape, and offset of each tensor
│ Tensor Data             │  ← Actual quantized weights
└─────────────────────────┘

GGUF replaced the earlier GGML format in August 2023. Key improvements:

  • Single-file, self-contained: metadata, tokenizer, and model weights all in one file
  • Memory-mappable: tensor data can be accessed directly via mmap without loading the entire file into memory, dramatically speeding up load times
  • Mixed quantization: different tensors within the same file can use different quantization formats
  • Supports 50+ model architectures: LLaMA, Mistral, Qwen, Gemma, DeepSeek, Phi, and more

Nearly all community-quantized models on HuggingFace are published in GGUF. It has become the de facto standard format for local LLMs.

Quantization Formats: A Complete Breakdown

This is the most important part of llama.cpp to understand. The quantization format determines the trade-off between model size, inference speed, and output quality.

Naming Convention

Q4_K_M
│ │ │
│ │ └── Size: S(mall) / M(edium) / L(arge)
│ └──── K: K-quant, advanced quantization using k-means clustering
└────── 4: 4 bits per weight
  • Q series (Q4_0, Q4_1, Q5_0, Q5_1, Q8_0): Basic quantization — per-block symmetric or asymmetric quantization with a uniform scale. Q4_0 stores one fp16 scale per 32 floats + 32 4-bit integers; Q4_1 adds an offset (w = d * q + m) for slightly higher precision
  • K series (Q3_K_S, Q4_K_M, Q5_K_M, Q6_K): K-quant — uses different precision for different tensor types — sensitive layers (attention projection, output) get more bits, while FFN layers are compressed more aggressively. The S/M/L suffix controls compression level: S (Small) is more aggressive, M (Medium) is balanced, L (Large) is more conservative
  • IQ series (IQ1_S, IQ2_XXS, IQ3_M, IQ4_NL): Importance-weighted quantization — uses a calibration dataset to compute an importance matrix, combined with lattice codebook vector quantization. Better quality at the same bit width compared to K series, but the quantization process is slower and requires calibration data

Format Comparison Table

Formatbits/weight7B Model SizePPL IncreaseUse Case
Q2_K~3.2~2.95 GBHighExtreme memory constraints
IQ3_M~3.8~3.52 GBLowTight memory but want better quality
Q4_0~4.5~3.83 GB+0.2499Legacy baseline, not recommended
Q4_K_S~4.7~4.36 GB+0.1149Limited memory but want K-quant
Q4_K_M~4.9~4.58 GB+0.0535Best choice for most people
Q5_K_M~5.7~5.33 GBVery lowQuality priority
Q6_K~6.6~6.14 GBNear zeroUse this if you have enough memory
Q8_0~8.5~7.95 GB≈ 0Highest quality requirement
F1616~14.96 GBBaselineReference only

Note that Q4_K_M’s PPL increase is only +0.0535, while legacy Q4_0 is +0.2499 — at the same 4-bit width, K-quant is an order of magnitude better in quality.

How to choose? If you have enough memory, go with Q6_K. For general use, Q4_K_M. For extreme memory savings, IQ3_M. Another rule of thumb: a larger model at lower quantization usually beats a smaller model at higher quantization — 14B Q4_K_M is typically smarter than 7B Q8_0.

IQ Formats: The Extreme Compression Option

IQ (importance quantization) formats have two core innovations:

  1. Importance matrix (imatrix): Runs a calibration dataset to determine which weights have the greatest impact on output, and prioritizes preserving those weights during compression
  2. Lattice codebook: Uses optimized vector quantization codebooks instead of simple linear quantization, providing greater expressiveness at the same bit width
Formatbits/weight7B Model SizeNotes
IQ1_S~1.87~1.87 GBExtreme compression, significant quality loss
IQ2_XXS~2.4~2.23 GBUltra-low bit, requires imatrix
IQ3_M~3.8~3.52 GBBetter quality than Q3_K at the same bit width
IQ4_XS~4.3Importance-weighted 4-bit
IQ4_NL~4.5Non-linear 4-bit quantization

There are also TQ (Ternary Quantization) formats: TQ1_0 (1.69 bpw) and TQ2_0 (2.06 bpw), which represent weights using only three values: -1/0/+1. These are the most extreme compression methods available.

IQ format quantization requires first generating calibration data with llama-imatrix. Skipping this step will cause llama-quantize to warn. Decoding speed is slightly slower than the K series. Currently best supported on CPU and Metal.

Hardware Backends

Cross-platform capability is llama.cpp’s greatest competitive advantage.

Apple Metal

The go-to backend for Mac users. Apple Silicon’s unified memory architecture lets CPU and GPU share the same memory pool — no extra data transfers needed.

cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j

-ngl 99 offloads all layers to the GPU. The Metal backend is very efficient on M-series chips because there’s no PCIe bandwidth bottleneck like with NVIDIA.

NVIDIA CUDA

The highest-performance backend, especially on high-end consumer cards (RTX 4090, 5090).

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j

Supports Flash Attention (-fa flag) and tensor parallelism across multiple GPUs. VRAM is a hard limit — anything that exceeds it falls back to CPU, and speed drops off a cliff.

Vulkan

A cross-platform GPU backend supporting NVIDIA, AMD, and Intel GPUs. Performance is slightly lower than native CUDA/Metal (roughly 80-90%), but wins on universality.

cmake -B build -DGGML_VULKAN=ON

For AMD users who find ROCm too much hassle, Vulkan is a simpler alternative.

CPU

You can run without a GPU — it’s just slower. Supports AVX2/AVX-512 (x86) and ARM NEON (Apple Silicon, Raspberry Pi). IQ formats have special CPU optimizations and can be faster than Q formats in certain cases.

The Key to Performance: Memory Bandwidth

LLM text generation is a memory bandwidth bottleneck, not a compute bottleneck — the GPU spends most of its time waiting for data to arrive from memory, not doing math. This explains a counterintuitive phenomenon:

HardwareMemory Bandwidth8B Q4_K_M DecodeNotes
M3 Pro 36GB150 GB/s~20 tok/sMetal
M2 Pro 32GB200 GB/s~38-48 tok/sHigher bandwidth than M3 Pro!
M4 Max 128GB546 GB/s~70-80 tok/sMetal
RTX 4090 24GB~1 TB/s~120-150 tok/sCUDA
RTX 3060 12GB360 GB/s~35 tok/sCUDA
i9-13900K (CPU)~90 GB/s~12 tok/sAVX2

The M2 Pro has higher memory bandwidth than the M3 Pro (200 vs 150 GB/s), so it’s actually faster at decoding. The RTX 4090 is unbeatable when the model fits entirely in VRAM, but a 70B model exceeds 24GB and must be offloaded to CPU, causing speed to drop off a cliff — at that point, a 128GB M4 Max is actually faster because unified memory eliminates data transfers.

Apple Silicon’s advantage is capacity; NVIDIA’s advantage is bandwidth.

CLI Tools

After compilation, llama.cpp produces several key binaries:

llama-cli: Interactive Inference

# Basic usage
llama-cli -m model.gguf -p "Explain quantum computing" -n 256

# Full parameter example
llama-cli \
  -m llama-3.1-8b-q4_k_m.gguf \
  -ngl 99 \          # GPU offload layers
  -c 8192 \          # context length
  -t 8 \             # CPU threads
  -fa \              # Flash Attention
  --temp 0.7 \       # temperature
  --top-p 0.9 \      # nucleus sampling
  -p "Write a haiku about coding"

Add -i for interactive mode, or --conversation for multi-turn chat.

llama-server: OpenAI-Compatible API

llama-server \
  -m model.gguf \
  -ngl 99 \
  -c 8192 \
  --port 8080 \
  --host 0.0.0.0 \
  -np 4              # 4 parallel slots

Once started, it provides:

  • POST /v1/chat/completions — Chat (OpenAI-compatible)
  • POST /v1/completions — Text generation
  • POST /v1/embeddings — Embeddings
  • POST /v1/messagesAnthropic Messages API compatible
  • POST /reranking — Document reranking
  • GET /health — Health check
  • GET /metrics — Prometheus metrics
  • Built-in Web UI (open http://localhost:8080 in a browser)

Supports function calling, structured output (JSON Schema), vision input, reasoning/thinking mode, dynamic LoRA loading, SSL/TLS, and API key authentication. Any tool that supports the OpenAI or Anthropic API can connect directly.

llama-quantize: Model Quantization

# Quantize an fp16 model to Q4_K_M
llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M

# IQ formats require calibration data
llama-quantize --imatrix imatrix.dat model-f16.gguf model-iq3_m.gguf IQ3_M

Calibration data for IQ formats can be generated with the llama-imatrix tool — feed it a representative text corpus to compute each weight’s importance.

llama-bench: Performance Benchmarking

llama-bench -m model.gguf -ngl 99 -t 8

Outputs prompt processing (prefill) and token generation (decode) speeds, making it easy to compare different quantization formats and hardware configurations.

Advanced Features

Speculative Decoding

LLM decoding is memory bandwidth-bound — processing N tokens at once takes roughly the same time as processing 1. Speculative decoding exploits this: it cheaply and quickly guesses multiple candidate tokens, then has the large model verify them all at once.

llama.cpp supports several speculation methods:

MethodPrincipleSpeedup
Draft modelA small model (e.g., 1B) generates a draft, the large model verifies1.8-2x, up to 3x
ngram-simpleMatches n-grams in the already-generated token history and uses subsequent tokens as draftsLow overhead
ngram-modHash-based n-gram statistics, shared across server slots, ~16MB memoryLow overhead
# Draft model approach
llama-server \
  -m large-model.gguf \
  -md draft-model.gguf \
  -ngl 99 -ngld 99 \
  --draft-max 16 \
  --spec-type draft

# N-gram approach (no additional model needed)
llama-server \
  -m model.gguf -ngl 99 \
  --spec-type ngram-mod \
  --spec-ngram-size-n 12

The best draft model pairing: large and small models from the same family (e.g., Llama 3.1 8B + 1B) — vocabularies must match. 0.5-1B draft models work best; going too large is counterproductive. Code generation benefits the most from speculative decoding due to its repetitive patterns.

Grammar-Constrained Sampling

Use BNF grammar to constrain model output format, with a 100% structural guarantee:

llama-cli -m model.gguf --grammar 'root ::= "{" "\"name\":" [^}]+ "}"' -p "Generate a JSON"

Stronger than JSON mode — you can define arbitrary grammar rules to ensure output is valid JSON, SQL, or any custom format.

Flash Attention

Flash Attention is enabled by default in llama.cpp. Use -fa off to disable it if needed.

Real-world benchmarks (M3 Max, Llama 3 8B, ~26K tokens):

  • Without FA: first token 80s, generation 11 tok/s
  • With FA: first token 72s, generation 32 tok/s — ~3x generation speed

The main benefit is during the prefill phase (processing long inputs) — the longer the context, the more pronounced the effect. A few models may show quality differences on specific GPUs, but llama.cpp automatically falls back for unsupported cases.

Multimodal Inference

libmtmd, added in April 2025, unified multimodal support. It supports vision-language models like LLaVA, Gemma 3, Qwen2-VL, and MobileVLM, and has begun supporting audio input:

llama-mtmd-cli -m model.gguf --mmproj mmproj.gguf -p "Describe this image" --image photo.jpg

llama-server also has built-in multimodal support and can receive images via the OpenAI-compatible API.

Getting Models from HuggingFace

The simplest approach is to use llama.cpp’s built-in HuggingFace integration — one command does it all:

# Download and run directly from HuggingFace
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF --conversation

Or manually download community-quantized GGUFs:

pip install huggingface_hub
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
  --include "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf" \
  --local-dir ./models

If you want to convert from safetensors yourself:

# Convert to fp16 GGUF
python convert_hf_to_gguf.py /path/to/hf-model --outfile model-f16.gguf

# Then quantize
llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M

Search for the GGUF tag on HuggingFace to find a large selection of pre-quantized models. Common quantizers include bartowski, TheBloke, and mradermacher.

Python Bindings

If you’d rather not use the CLI, llama-cpp-python provides a Python interface:

from llama_cpp import Llama

llm = Llama(
    model_path="model-q4_k_m.gguf",
    n_gpu_layers=-1,    # Offload everything to GPU
    n_ctx=8192,
    flash_attn=True,
)

output = llm.create_chat_completion(
    messages=[{"role": "user", "content": "What is KV cache?"}],
    temperature=0.7,
)
print(output["choices"][0]["message"]["content"])

It also provides an OpenAI-compatible server:

pip install 'llama-cpp-python[server]'
python -m llama_cpp.server --model model.gguf --n_gpu_layers -1

Comparison with Other Inference Engines

llama.cppMLXvLLMExLlamaV2
LanguageC/C++Python (Apple)PythonPython/CUDA
HardwareCPU, Metal, CUDA, Vulkan, ROCmApple Silicon onlyPrimarily GPU (NVIDIA/AMD/Intel)NVIDIA only
Quant FormatGGUF (Q/K/IQ)MLX 4-bitAWQ, GPTQ, fp8EXL2
PositioningCross-platform local inferenceApple ecosystem optimizedProduction-grade high throughputMaximum speed on NVIDIA
Server ModeYes (OpenAI-compatible)Community solutionsNativeYes (tabbyAPI)
Model SourceHuggingFace GGUFHuggingFace MLXHuggingFace originalHuggingFace EXL2

Key decision points:

  • vs Ollama: Ollama’s underlying engine is llama.cpp, wrapped in a Go layer. Using llama.cpp directly is 13-80% faster than Ollama (depending on the scenario), but Ollama offers a better developer experience
  • Mac users: Both llama.cpp (Metal) and MLX are solid choices. MLX is 30-50% faster on Apple Silicon, but llama.cpp has a larger ecosystem and model selection
  • NVIDIA users: For production use vLLM (35x higher multi-user throughput); for local use, llama.cpp or ExLlamaV2
  • Cross-platform: llama.cpp is the only engine that supports CPU/Metal/CUDA/Vulkan/ROCm across the board

Mac Quick Start

# 1. Build (Metal is auto-enabled)
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build    # Metal is enabled by default on macOS
cmake --build build --config Release -j

# Or use Homebrew
brew install llama.cpp

# 2. Download a model
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
  --include "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf" \
  --local-dir ./models

# 3. Run it
llama-cli \
  -m models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  -ngl 99 -c 8192 -fa \
  --conversation

# 4. Or start the server
llama-server \
  -m models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  -ngl 99 -c 8192 -fa \
  --port 8080

Recommendations for M3 Pro 36GB: Use Q6_K (~5.5 GB) for 8B models, Q4_K_M (~8.5 GB) for 14B, and Q4_K_M (~20 GB) for 32B. Reserve the remaining memory for KV cache and the system.

The Big Picture

llama.cpp is the infrastructure layer of local LLM inference. Ollama, LM Studio, and GPT4All all run on top of it. The benefit of using llama.cpp directly is full control over quantization format selection, GPU offload strategy, context length allocation, and sampling parameters — all of which are automated in Ollama, which is convenient but opaque.

In 2026, llama.cpp is far more than “running LLaMA on a CPU.” It’s a mature multi-backend inference engine supporting 50+ model architectures with 100K+ GitHub stars. Development has accelerated since the ggml.ai team joined Hugging Face — MCP client support, Anthropic API compatibility, audio input, and autoparser structured output are all recent additions.

If you’re already using Ollama and satisfied with it, there’s no need to switch. But if you want a deeper understanding of every aspect of local inference, or need features Ollama doesn’t support (speculative decoding, grammar sampling, custom quantization), llama.cpp is the next step.

References