🌏 中文版
Vector search (Bi-Encoder) is fast and efficient, but it has a fundamental limitation: queries and documents are encoded independently, with no cross-attention between them.
A Bi-Encoder converts the query and each document into separate vectors, then measures distance with cosine similarity. During this process, the query tokens never see the document content, and document tokens never see the query. This architecture works well for large-scale ANN (approximate nearest neighbor) search, but it isn’t precise enough for relevance scoring.
Cross-Encoders work differently: they feed the query and document together into a Transformer, letting them attend to each other, and output a relevance score that genuinely reflects “how well this document answers this query.”
Architecture Comparison
Bi-Encoder (vector search):
Query → [Encoder] → q_vector
Doc → [Encoder] → d_vector
Score = cosine(q_vector, d_vector)
Cross-Encoder (reranking):
[Query; Doc] → [Transformer] → relevance_score
Cross-Encoder computation is O(n) — it scores each candidate document individually — so it’s not suitable for searching across a large index. But once you’ve narrowed the field to a few dozen candidates, the compute is entirely manageable and the precision improvement is substantial.
Two-Stage Architecture
This is the standard industry combination:
Phase 1: Recall (Bi-Encoder)
Full index → Top-100 candidates (fast)
Phase 2: Precision (Cross-Encoder)
Top-100 → Top-10 reranked (accurate)
The actual configuration used in this system:
- Input: Candidates after RRF fusion (typically 20–30)
- Model:
@cf/baai/bge-reranker-base - Output: A relevance score per document (0.0 – 1.0)
Threshold Filtering
After reranking, rather than blindly taking Top-K, we first filter out low-relevance documents using a threshold:
const threshold = config.reranker_relevance_threshold ?? 0.5;
const minKeep = config.reranker_min_keep ?? 3;
const filtered = reranked.filter(doc => doc.score >= threshold);
// Safety net: if everything falls below the threshold, keep at least minKeep
const final = filtered.length >= minKeep
? filtered
: reranked.slice(0, minKeep);
min_keep is an important safety design: if all candidates score low and get filtered out, the LLM has no context to work with and falls back to general knowledge — which tends to hallucinate. Keeping a minimum number of documents lets the downstream LLM-as-Judge decide whether to add a disclaimer to the response.
Skip Condition
Reranking is skipped when there is only one candidate or fewer — there’s nothing to reorder, so we save an API call.
skipWhen: (ctx) => ctx.candidateMatches.length <= 1
Why BGE Reranker
bge-reranker-base is a Cross-Encoder from BAAI, the same family as BGE-M3, which is also the embedding model in this system. Using models from the same family ensures more coherent understanding of the vector space. It’s also available as a first-party option on Cloudflare Workers AI.
For higher precision requirements, you can switch to bge-reranker-large, but latency and cost will increase accordingly.
Impact on the Overall System
Reranking has the greatest impact on final output quality in the following scenarios:
Highest benefit:
- Multi-path retrieval (HyDE + Multi-Query + BM25) produces many candidates of uneven quality
- Complex query intent where simple cosine similarity ordering tends to drift off target
Lower benefit:
- Already few candidates (< 5)
- Simple queries with clear semantics where the first-round results are already decent
Overall, reranking is the most direct lever for improving precision in a RAG pipeline, and the cost is well within reason — running cross-attention over 30 candidates is much cheaper than a single LLM generation pass.
References
Loading...