🌏 中文版
Vector search architectures fall into three broad categories. Understanding the trade-offs between them is the key to picking the right tool.
Bi-Encoder (dual-tower): Query and document are each embedded independently into a single vector, then scored by cosine similarity. Fast and ANN-friendly, but because there is no interaction between query tokens and document tokens, precision has a ceiling.
Cross-Encoder (cross-attention): Query and document are fed into a Transformer together so every token can attend to every other token — highest precision of the three. The catch: each (query, doc) pair requires a separate forward pass, making it O(n) at query time. It cannot be used for large-scale retrieval; it is only practical as a reranker over a small candidate set.
ColBERT (Late Interaction): Sits between the two. Query and document are encoded separately, but every token’s vector is kept rather than collapsed into a single embedding. At scoring time, each query token finds its most similar document token (MaxSim), and the scores are summed.
MaxSim Scoring
Query tokens: [q1, q2, q3, q4] → 4 vectors
Document tokens: [d1, d2, d3, ..., d20] → 20 vectors
Score(query, doc) = Σᵢ max_j sim(qᵢ, dⱼ)
q1 compared against all document tokens → take the highest score
q2 compared against all document tokens → take the highest score
q3 compared against all document tokens → take the highest score
q4 compared against all document tokens → take the highest score
Total score = sum of the four highest scores
This design preserves fine-grained, token-level comparison (close to a Cross-Encoder) while allowing document vectors to be precomputed and stored offline — no need to re-run the Transformer at query time (much faster than a Cross-Encoder).
Comparison: Bi-Encoder vs. ColBERT vs. Cross-Encoder
| Bi-Encoder | ColBERT | Cross-Encoder | |
|---|---|---|---|
| Query vectors | 1 | N_q (per token) | N/A |
| Document vectors | 1 | N_d (per token) | N/A (joint computation) |
| Precomputable doc vectors | ✅ | ✅ | ❌ |
| Token interaction | ❌ | Partial (MaxSim) | ✅ (full attention) |
| Index size | Small | Large (N_d × bigger) | N/A |
| Search speed | Fast | Medium | Slow |
| Precision | Low | Medium–High | High |
The price ColBERT pays is index size: instead of one vector per document, you get one vector per token. A 200-token document produces 200 vectors in the ColBERT index. At scale, both storage cost and query latency grow significantly.
ColBERTv2 Improvements
The original ColBERT’s index was too large. ColBERTv2 addresses this with residual compression:
- Run k-means over all token vectors to find cluster centroids (typically 64 or 256)
- Store each vector as “nearest centroid + residual”
- Quantize the residual to 2 bits
The result is a 6–10× reduction in index size with minimal precision loss.
Where ColBERT Fits in a RAG Pipeline
ColBERT can slot into two different stages:
As a first-stage retriever (replacing Bi-Encoder): Better precision than a Bi-Encoder, but the larger index and slower speed make it best suited for corpora in the hundreds of thousands of documents or fewer.
As a second-stage reranker (replacing Cross-Encoder): Faster than a Cross-Encoder (precomputed doc vectors), with comparable precision. A good fit when reranking a few hundred candidates.
For a climbing community platform at the scale of a few thousand to tens of thousands of routes, using ColBERT as a reranker is reasonable — the index size stays manageable.
Practical Usage
The most mature implementation today is Stanford’s RAGatouille library:
from ragatouille import RAGPretrainedModel
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
# Index documents
RAG.index(
collection=documents,
index_name="climbing-routes",
)
# Search
results = RAG.search(query="intermediate routes at Longdong", k=10)
In a TypeScript / Cloudflare Workers environment, however, ColBERT support is still very limited. Using it would require running a separate Python service, adding meaningful architectural complexity.
Bottom Line
ColBERT occupies an interesting middle ground in vector search — elegant in theory. In practice, the index size problem and the immature tooling ecosystem (especially in TypeScript) make it less practical than a straightforward Bi-Encoder retrieval + Cross-Encoder reranking two-stage pipeline.
For most RAG systems, the established Bi-Encoder + Cross-Encoder combination remains the more mature choice. ColBERT is worth watching — especially as ColBERTv2’s compression brings index costs down to an acceptable range and more platforms start offering native support.
References
- ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT (2020)
- ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction (2021)
- NobodyClimb System Architecture: Full-Stack Climbing Community on Cloudflare (zh-TW only)
- NobodyClimb AI Architecture: 20-Node RAG Pipeline (zh-TW only)
Loading...