- Why KV Cache Is the Bottleneck
- Stage 1: PolarQuant — Rotation + Scalar Quantization
- Stage 2: QJL — 1-bit Residual Compression
- Asymmetric K/V Strategy: The Most Important Design Decision
- Three Key Research Findings
- Real-World Performance Data
- How to Use on Mac
- Comparison with Other KV Cache Optimizations
- Overall Takeaway
- References
🌏 中文版
The memory bottleneck of LLM inference isn’t just the model itself — when contexts get longer, KV cache is the real memory killer. A 70B model at 128K context can easily consume over 40GB just for the KV cache. TurboQuant+ directly compresses this memory, achieving 3.8-6.4x compression through two-stage quantization with virtually no quality loss.
Why KV Cache Is the Bottleneck
During Transformer inference, the attention computation for each token needs to access Key and Value vectors of all previous tokens. These vectors are stored in fp16, and memory consumption grows linearly with context length:
KV cache size ≈ 2 × num_layers × num_heads × head_dim × context_length × 2 bytes
Model weights are fixed, but the KV cache keeps growing as the conversation gets longer. This is why your MacBook can load a 32B model but OOMs during long conversations — it’s not that the model is too large, it’s the cache eating all the memory.
vLLM’s PagedAttention solves the memory fragmentation problem of KV cache. TurboQuant takes a different angle: directly shrinking each vector.
Stage 1: PolarQuant — Rotation + Scalar Quantization
The core observation of PolarQuant: after normalizing high-dimensional vectors to the unit sphere, coordinates approximately follow a Gaussian distribution and can be efficiently compressed with optimal scalar quantization.
The process:
- Extract L2 norm, normalizing the vector to a unit vector
- Walsh-Hadamard random rotation (O(d log d)) to evenly distribute information across all dimensions — avoiding some coordinates being extremely large while others are near zero
- Optimal scalar quantization: use Lloyd’s algorithm to pre-compute the optimal codebook for the Gaussian distribution, quantizing each coordinate
- Store quantization indices + norm
The rotation step is critical. Without rotation, outlier dimensions severely degrade quantization quality. The Walsh-Hadamard transform is faster than random matrices (O(d log d) vs O(d²)) and is orthogonal and distance-preserving.
Stage 2: QJL — 1-bit Residual Compression
PolarQuant reconstruction inevitably has residuals. QJL (Quantized Johnson-Lindenstrauss) uses extreme compression to preserve residual information:
- Compute residual = original vector - Stage 1 reconstruction
- Store residual L2 norm
- Multiply by a random projection matrix, take the sign → only 1 bit per dimension
The Johnson-Lindenstrauss lemma guarantees: after random projection, the inner product relationships between vectors are preserved with high probability in high dimensions. 1-bit is coarse, but when dimensions are high enough (LLM head_dim is typically >= 64), it’s statistically sufficient to compensate for Stage 1’s quantization error.
The combined compression of both stages:
| Format | Stage 1 | Stage 2 | Total bits/element | Compression ratio (vs fp16) |
|---|---|---|---|---|
| turbo2 | 1-bit | 1-bit | 2 | 6.4x |
| turbo3 | 2-bit | 1-bit | 3 | 4.6-5.1x |
| turbo4 | 3-bit | 1-bit | 4 | 3.8x |
Asymmetric K/V Strategy: The Most Important Design Decision
TurboQuant+ uses different quantization strategies for the Key cache and Value cache:
- K cache →
TurboQuant(two-stage, PolarQuant + QJL) - V cache →
TurboQuantMSE(PolarQuant only, no QJL)
The reason lies in different mathematical properties:
- Attention score computation is Q . K^T (inner product), which requires preserving angular relationships between vectors → QJL’s JL property is designed to preserve inner products
- Attention weighted sum is softmax(scores) . V (linear combination), which only needs MSE minimization → PolarQuant alone is sufficient
This asymmetric design stems from the project’s three key research findings.
Three Key Research Findings
V Compression Is Nearly Free
Compressing the Value cache to 2-bit barely affects attention output quality — provided Key precision is maintained. This finding is counterintuitive, but experimental data is highly consistent: V’s quantization error gets diluted by the softmax weighted average.
K Compression Is the Sole Source of Quality Degradation
All measurable quality degradation comes from Key cache compression. This explains why asymmetric configurations (high-precision K + low-precision V) can significantly recover quality. With turbo4 using 3+1 bit for K and only 3 bit for V, it achieves near-lossless performance.
Boundary Layer Sensitivity
The first and last two layers of a Transformer are particularly sensitive to quantization. Protecting these layers (no compression or higher precision) can recover 37-91% of the quality gap. This strategy has minimal cost — a 32-layer model only needs to protect 4 extra layers, increasing memory by less than 15%.
Real-World Performance Data
turbo4 benchmarks on Apple M5 Max:
| Metric | turbo4 | q8_0 (baseline) |
|---|---|---|
| Perplexity | 6.125 | 6.111 |
| Prefill speed | ~2747 tokens/sec | — |
| Decode speed | ~0.9x baseline | 1x |
| Memory compression | 3.8x | 1x |
Perplexity differs by only 0.014, which is virtually imperceptible. This gap only becomes noticeable at turbo3 and turbo2.
The community has validated results on M1-M5 Mac, NVIDIA RTX 3080 Ti / 3090 / 4090 / 5090, and AMD RX 9070 XT.
How to Use on Mac
Current Status
TurboQuant+‘s llama.cpp integration is being pushed upstream via PR. There are currently two ways to try it:
Option 1: Python Prototype (Principle Verification)
git clone https://github.com/TheTom/turboquant_plus
cd turboquant_plus
pip install .
# Run demo to see compression results
python benchmarks/demo.py
# Full benchmark
python benchmarks/run_benchmark.py
Requirements: Python 3.10+, NumPy, SciPy. This is numerical verification, not actual LLM inference.
Option 2: llama.cpp + Metal (Actual Inference)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j
-DGGML_METAL=ON enables Apple Metal GPU acceleration. -ngl 99 offloads all layers to the GPU.
Mac Hardware Reference
| Mac Memory | Recommended Model | Quantization Format | Expected Performance |
|---|---|---|---|
| 16 GB | 7-8B | Q4_K_M | ~30-40 t/s |
| 36 GB | 14-32B | Q4_K_M | ~8-25 t/s |
| 64 GB | 70B | Q4_K_M | ~5-8 t/s |
| 96-128 GB | 104B @ 128K ctx | turbo4 | Verified feasible |
TurboQuant’s sweet spot is 32GB+ with long context scenarios — the longer the context, the higher the KV cache proportion, and the greater the compression benefit. For short conversations, traditional quantization is sufficient.
Comparison with Other KV Cache Optimizations
| Method | Strategy | Compression Ratio | Quality Impact |
|---|---|---|---|
| PagedAttention (vLLM) | Eliminate memory fragmentation | ~1.25x | Zero |
| KV cache eviction | Drop unimportant tokens | Variable | Lossy |
| GQA / MQA | Share KV heads | 4-8x | Decided at training time |
| KIVI / KVQuant | Per-channel quantization | 2-4x | Low |
| TurboQuant | Rotation + optimal quantization + JL residual | 3.8-6.4x | Very low |
TurboQuant’s advantage: high compression ratio, no retraining required, and can be stacked with PagedAttention.
Overall Takeaway
TurboQuant+ is currently the open-source KV cache quantization scheme with the highest compression ratio and the most solid theoretical foundation. The two-stage design (PolarQuant handles the main body + QJL compensates residuals) is mathematically elegant and practically viable in engineering.
For Mac users, once the llama.cpp upstream merge lands, this will be a killer optimization for long contexts — the same memory can support 3-4x the context length. A 36GB M3 Pro with turbo4 has the potential to stably run 32K or even longer contexts on a 32B model.
References
Loading...