Semantic Similarity ≠ Retrieval Relevance: Scenarios, Detection, and Remedies for Systematic Embedding Retrieval Failures

TL;DR Cosine similarity and relevance systematically diverge across an entire class of scenarios: negation (most IR models score at or below random on NevIR), exact identifiers, numeric thresholds, and logical combinations (SoTA models achieve recall@100 < 20 on LIMIT) -- some of these hit the theoretical ceiling of the single-vector paradigm, and switching to a larger model will not help. Recommended remedy order: hybrid BM25 -> reranker (Anthropic measured -67%) -> upstream metadata routing -> domain fine-tuning / multi-vector.

#retrieval #embedding #rag #vector-search #llm

Table of Contents

Why Cosine Similarity ≠ Relevance
The Ten Scenarios with the Largest Gap (Checklist)
Detection Without Ground Truth
Remedies: From Cheapest to Most Thorough
The Big Picture
References

🌏 中文版

The default assumption in dense retrieval is “more semantically similar = more relevant.” But similarity is the model’s geometric approximation of overall passage semantics, while relevance is defined by the user’s current task as “right or wrong” — the two align for most ordinary queries, yet systematically diverge across an entire class of scenarios. Worse, this is not entirely a matter of model quality: part of it is a theoretical ceiling of the single-vector paradigm, and switching to a larger embedding model or a different provider will not help. This article surveys the root mechanisms behind the gap, the ten scenarios most likely to cause problems, how to detect failures without ground truth, and a recommended remedy sequence from cheapest to most thorough.

Why Cosine Similarity ≠ Relevance

Dimensional ceiling (theory). Google DeepMind’s LIMIT paper (arXiv:2508.21038) proves that the number of top-k document “combinations” a single vector can represent is bounded by the embedding dimension d — given enough documents, there inevitably exist relevant combinations that vectors of that dimension mathematically cannot rank correctly, even if you freely optimize the vectors directly on the test set, they will collapse beyond a certain critical point. The empirical results are even worse: LIMIT’s tasks are extremely simple (combinatorial queries like “who likes apples?”), yet SoTA MTEB models achieve recall@100 below 20.

Granularity dilemma. WeChat AI’s CapRetrieval (arXiv:2506.08592) shows that encoders must simultaneously align overall passage semantics and highlight fine-grained entities — two objectives that pull against each other. Overall passage semantics may be correct, but details like “whether there is a cat in the image” get averaged out.

Cosine values themselves are unreliable. Steck et al. (Netflix, arXiv:2403.05440) argue that learned embeddings have rescaling degrees of freedom, so cosine values can be arbitrary and vary with regularization — they should not be treated as calibrated relevance scores (a single team’s theoretical argument, but widely cited in the industry).

Weak OOD generalization. The SPAR paper from Findings of EMNLP 2022 (arXiv:2110.06918) empirically shows that dense retrievers degrade when they leave the training distribution, while training-free BM25 remains more stable. Combined with the fact that embeddings are trained with contrastive learning and paraphrase objectives, they are inherently good at synonym rephrasing and inherently poor at lexical precision, negation, and numeric reasoning.

The Ten Scenarios with the Largest Gap (Checklist)

Ranked by “how likely embedding is to fail”:

Negation / antonymy: “sugar-free beverages” vs. “sugary beverages” produce nearly identical vectors. NevIR (EACL 2024, arXiv:2305.07614) shows that most IR models perform no better than random on negation pairs, and a subsequent reproduction (arXiv:2502.13506) confirms the result.
Exact identifiers: Contract numbers, SKUs, invoice numbers — semantically “all just strings of characters,” but the user needs an exact match.
Company-specific abbreviations / jargon: Internal acronyms not seen in pretraining corpora; the embedding cannot map them to their expanded forms.
Numeric magnitude / thresholds: “Amount > 1 million.” Embeddings do not represent orders of magnitude or comparisons.
Temporal predicates: “Latest version,” “filed after 2020.” Embeddings do not encode temporal ordering.
Polysemy: pool (swimming pool / reinsurance pool), claim — the domain-specific meaning may rank below random sentences.
Fine-grained entities / events: Overall passage semantics match, but details do not (the focus of CapRetrieval).
Combinatorial / logical relevance: “Satisfies A and not B” — the dimensional ceiling proven by LIMIT is exactly where this breaks down.
Multi-hop reasoning / aggregation counting: “Who is the supervisor of the person who signed the contract?” — a single similarity lookup cannot answer this; it requires decomposition or SQL.
Cross-lingual: Mixed Chinese-English queries require multilingual models.

Common signal: The gap is largest when the query is short, demands precision, and relevance is defined by “task rules” rather than “semantic resemblance.”

Detection Without Ground Truth

In production there is no gold standard answer; rely on proxy signals:

Retriever vs. cross-encoder divergence: Retrieve top-k with a cheap bi-encoder, then rescore with a cross-encoder — if the two rankings severely disagree, the embedding recall is untrustworthy. Note that reranker scores are suitable for ranking but are not necessarily good absolute relevance scores.
LLM-as-judge context relevance: The first leg of the RAG triad (context relevance / groundedness / answer relevance) — have an LLM judge per-chunk whether “the retrieved chunk is relevant to the query,” sampled at runtime.
Score distribution signals: If the top-1 score is absolutely low, or top-1 through top-k scores are tightly clustered (small margin), it indicates low confidence and possibly an entirely irrelevant batch (industry practice, no public benchmark; use as an auxiliary signal).
Lexical cross-check: Run BM25 on the same query simultaneously — if documents with high lexical overlap retrieved by BM25 are absent from the dense top-k, exact-match / identifier-type recalls are likely being missed.
Offline adversarial probes: Periodically stress-test your retriever with “minimal adversarial units” from LIMIT, NevIR, and CapRetrieval to quantify degradation on negation, combinations, and fine-grained scenarios.

Remedies: From Cheapest to Most Thorough

Hybrid (BM25 + dense fusion): Recovers lexical exact matches and identifier lookups. Anthropic’s Contextual Retrieval uses contextual embedding + contextual BM25, reducing retrieval failure rate by 49%. See the site’s hybrid search and Contextual Retrieval articles.
Add a reranker (cross-encoder): In the same Anthropic experiment, stacking reranking further reduced the failure rate by 67% (top-20). Cross-encoders are also named in the LIMIT paper as one method that bypasses the single-vector dimensional ceiling. See: cross-encoder reranking.
Multi-vector (ColBERT-style late interaction): Greater expressiveness than single-vector; performs far better on LIMIT. See: ColBERT late interaction.
Query rewriting / decomposition / HyDE: In multi-turn conversations, first rewrite into a self-contained query; for multi-hop questions, decompose first. See: HyDE.
Upstream structured filtering (underrated but often most effective): Use time, numeric values, and categories as metadata filters; route proper nouns and identifiers to keyword exact matching — do not feed them to vectors. In enterprise scenarios, the primary source of reliability is strong upstream filtering, not stacking rerankers on top of weak recall.
In-domain fine-tuning of embeddings: For OOD and fine-grained issues, CapRetrieval achieved the best results using a data generation strategy for fine-tuning.
Router: Route by query type — identifiers -> keyword search, numeric/temporal -> SQL / filter, semantic Q&A -> dense + rerank, aggregation counting -> aggregation pipeline. See: query classification and adaptive routing.

The Big Picture

One-sentence version: The gap is largest when “relevance is defined by rules rather than semantics” — negation, exact identifiers, numeric/temporal values, logical combinations — and combinatorial queries hit a theoretical ceiling that switching to a larger model cannot solve. Detection relies on a three-part toolkit: cross-encoder divergence, LLM-judge context relevance, and adversarial probes; the remedy sequence is hybrid -> rerank -> upstream metadata routing -> fine-tuning / multi-vector.

Design implication: Do not treat embedding as a universal retriever. It is an expert at “how semantically similar are these,” not an arbiter of “is this correct” — hand rule-based relevance to rule-based tools (BM25, filters, SQL), and let vectors do only what they do best.