Skip to content

DeepSeek-OCR: The 10x Compression Experiment That Turns Long Context into Images

May 9, 2026 1 min
TL;DR DeepSeek-OCR's paper is titled Contexts Optical Compression -- OCR is just the means; what it actually validates is that 'rendering text as images and feeding them to a VLM' achieves 10x compression at 97% accuracy. This is a qualitative shift for long-context LLM and RAG token costs.

🌏 中文版

October 2025 was an OCR explosion — over six open-source OCR/VLM models released in a single month. But DeepSeek-OCR is not just another one riding the wave. Its paper is titled Contexts Optical Compression: OCR is the means; context compression is the goal. If this approach works out, the cost of long context in future LLMs may no longer be measured in token count, but in pixel count.

What It’s Betting On

The core bottleneck of current LLMs processing long text is attention’s O(n²). 10,000 tokens means 10,000 discrete processing steps, regardless of how much information density those tokens carry.

DeepSeek’s bet: a 1024x1024 document image can be represented with 10x fewer vision tokens than text tokens, while still achieving 97% accuracy when decoded back to the original text.

Key experimental results from the paper:

Compression ratio (text tokens / vision tokens)OCR accuracy
< 10x97%
20x~60%

In other words, 1 vision token can carry roughly the information of 10 text tokens, nearly losslessly. If this conclusion extends to general text (not just OCR), it effectively creates a compression channel for LLM context windows.

Architecture: Two-Stage

Input image (e.g. 1024×1024)

DeepEncoder (380M)
  ├── SAM visual features
  ├── CLIP semantic features
  └── 16× compressor → few vision tokens (64~1853)

DeepSeek3B-MoE-A570M (3B params, 570M activated per token)

Text output (plain text / Markdown / HTML tables / grounding coordinates)

Key design points:

  • DeepEncoder maintains low activation under high-resolution input — this is the key to throughput
  • MoE decoder has 3B total parameters but only activates 570M, so inference cost is close to a small model
  • SAM + CLIP concatenation gives it stronger visual grounding capability than typical OCR models

Five Resolution Modes

At inference time, you choose how many vision tokens to spend:

Modebase_sizeimage_sizecrop_modeUse case
Tiny512512falseSimple pages, speed priority
Small640640falseGeneral documents
Base10241024falseBalanced option
Large12801280falseDetail-dense content
Gundam1024640trueMulti-column, mixed tables (multi-tile slicing)

For complex layouts (academic papers, newspapers), you must enable Gundam for practical accuracy. In the paper’s newspaper tests, because each page had 4,000-5,000 text tokens, Gundam was required to keep the compression ratio below 10x.

Two Comparison Groups on OmniDocBench

DeepSeek-OCR deliberately picked two extreme opponents:

Modeltokens / pageResult
GOT-OCR 2.0256baseline
DeepSeek-OCR100wins
MinerU 2.06000+baseline
DeepSeek-OCR< 800wins

The first group proves “token count can be even lower”; the second proves “even when the opponent uses 7-8x more tokens, I still win with fewer.” Production throughput: 200,000+ pages per day on a single A100-40G, clearly optimized for “generating training data for LLMs/VLMs.”

Contemporary Competitors: The Pure OCR Track

If all you need is PDF-to-Markdown conversion, 2025 actually offers more choices:

ModelParamsolmOCR-BenchSpeed (pages/s)Highlight
LightOnOCR1B76.15.55Fastest, cheapest
DeepSeek-OCR3B (570M MoE)75.74.65Fewest vision tokens
PaddleOCR-VL0.9B80.02.20Most compact, multilingual
dots.ocr3B79.11.94100+ languages incl. low-resource
olmOCR-27B82.41.78RLVR training, strong handwriting
Chandra-OCR8B83.11.29Best handwriting, multi-format output

Looking at the OCR task alone, the most accurate are Chandra and olmOCR-2; the best speed-to-cost ratio goes to LightOnOCR. DeepSeek-OCR does not stand out on this dimension.

Its true uniqueness lies in achieving near-first-tier accuracy with the fewest vision tokens — and that is exactly the evidence the “optical compression” research direction needs.

The Real Peers: Research Direction

If you place DeepSeek-OCR back on the “visual context compression” research track, its true peers are only two or three:

Vist (Vision-centric Token Compression, NeurIPS 2025 spotlight)

  • Slow-fast dual pathway architecture, mimicking human “scanning + close reading”
  • Fast path: distant tokens rendered as images, processed by a frozen lightweight vision encoder for an overview
  • Slow path: nearby tokens kept as text, fed to the LLM for close reading
  • Result: at the same accuracy, tokens down 2.3x, FLOPs down 16%, memory down 50%

Glyph (Tsinghua, October 2025, same period)

  • Same idea of “render text as images and feed to a VLM”
  • Discussed alongside DeepSeek-OCR during the same week

DeepSeek-OCR’s contribution on this track is pushing the compression ratio to its maximum (10x) and validating accuracy, while Vist provides a more engineering-oriented application framework (dual pathway rather than full image).

Implications for RAG and Long Context

At the end of the paper, DeepSeek hints at an interesting direction: memory decay mechanisms. Older conversations/context stored at higher compression ratios (blurrier images, fewer tokens), recent context stored at lower compression (clear) — simulating the brain’s memory decay.

For practitioners actually building RAG systems, here are some pragmatic trade-offs in the near term:

ScenarioRecommendation
Pure markdown / plain text contentNo OCR needed; chunking + embedding is still the optimal solution
Need to ingest external PDFs / slides / screenshotsPaddleOCR-VL 0.9B (compact) or Chandra (most accurate)
Want to experiment with “infinite context”Follow Vist and DeepSeek-OCR’s subsequent work
Cloudflare Workers / edge environmentsCan’t run locally; must use external GPU APIs (vLLM, SGLang already supported)

DeepSeek-OCR released weights and methodology, not a product. It produced reproducible experimental evidence for a direction previously only discussed in papers (visual modality as compression channel), which is more significant than “yet another OCR SOTA.”

The Bottom Line

DeepSeek-OCR is a research experiment dressed up as OCR. It’s not the most accurate at the OCR task, but it proves that vision tokens can carry 10x the information density of text tokens. If you just want document parsing, choose Chandra or olmOCR-2; if you want to understand how LLM long context might evolve, the DeepSeek-OCR paper is worth reading from start to finish.

The paper explicitly states the next steps: digital-optical text interleaved pretraining, needle-in-a-haystack testing. If those experiments also succeed, the definition of LLM context windows will truly need to be rewritten.

References