Skip to content

RAG Multi-Entity Queries: When the User Lists Five Routes and the System Only Sees the First

Mar 28, 2026 1 min
TL;DR The RAG system's extractRouteReference() used a for...return pattern that grabbed only the first match — so when a user provided five completed routes, only one was used. The fix evolves through three layers: rule-based multi-entity extraction, user profile aggregation, and embedding centroid.

🌏 中文版

The user says: “I recently completed Baihu 5.11d, Lightning 5.12a, Looks Like I Can 5.11c, Bubblegum 5.11b, and Hsinchu Hakka 5.10d — recommend 3 routes I haven’t climbed.” The system sees Baihu and returns. The other four routes? Gone.

This is a real bug that surfaced in the NobodyClimb climbing recommendation system. This post documents the root cause, how academia and industry handle similar problems, and the layered solution we ultimately chose.

The Problem: The Cost of for…return

The logic in extractRouteReference() was simple: iterate over known route names, match against the query string, and return on the first hit. This works perfectly for single-entity queries like “recommend routes similar to Baihu.” But when a user provides five routes at once, three kinds of information evaporate immediately:

Difficulty range loss — The five routes span 5.10d to 5.12a, meaning the user’s ability covers three major grade levels. By grabbing only Baihu at 5.11d, the recommendation range gets pinned to 5.11a–5.12b (±3 steps), completely ignoring the fact that the user has already sent 5.12a.

Style preference loss — Multiple routes may be spread across different crags and types (sport / trad / boulder), and that distribution reflects climbing preferences. A single route cannot represent it.

Incomplete exclusion list — The user said “routes I haven’t climbed,” but excludeRouteId only excludes Baihu. The other four routes can still appear in the recommendation results.

In IR terminology, this is a complex information need. Metzler & Croft (2005) noted long ago that most retrieval systems assume queries are atomic, but in practice users frequently embed multiple entities and implicit preferences in a single query. In the recommendation context, this is a variant of the cold-start problem: the user has proactively provided rich preference signals, and the system is only consuming a fraction of them.

Solution Landscape

After surveying the literature, five approaches emerged (pun intended).

Multi-Entity Extraction

The most intuitive fix: change extractRouteReference() to extractRouteReferences() and return an array instead of a single result.

Rule-based approach: Replace for...return with for...push, adding a consumed-range mechanism to prevent overlapping matches (once “Looks Like I Can” is matched, “I Can” cannot match again). Aggregate the difficulty of all routes into a union range:

gradeFilter = {
  $gte: min(allGrades) - margin,
  $lte: max(allGrades) + margin
}

LLM-based approach: Have the LLM extract all routes at once via structured output:

{
  "routes": [
    {"name": "Baihu", "grade": "5.11d"},
    {"name": "Lightning", "grade": "5.12a"},
    {"name": "Looks Like I Can", "grade": "5.11c"},
    {"name": "Bubblegum", "grade": "5.11b"},
    {"name": "Hsinchu Hakka", "grade": "5.10d"}
  ],
  "intent": "recommend_next_challenge",
  "exclude_mentioned": true
}

On the academic side, Li et al. (2020)‘s FLAT (Flat-Lattice Transformer) achieved SOTA on Chinese NER and handles multiple overlapping entities within a single sentence. Yan et al. (2021) reframed NER as a reading comprehension task, which naturally supports multi-entity extraction. In industry, Amazon Alexa’s Multi-slot NER and Rasa NLU’s CRF + Transformer pipeline solve the same problem.

Query Decomposition

Decompose a complex query into multiple sub-queries, retrieve independently, then merge. For this case:

  1. Profile Sub-query: Extract all five routes → build a user capability profile
  2. Exclusion Sub-query: Exclude all five routes → build an exclusion list
  3. Recommendation Sub-query: Use the profile as a basis → execute recommendation retrieval

There are several relevant frameworks. Self-Ask (Press et al., 2023) has the LLM ask itself follow-up questions, breaking a complex question into independently answerable sub-questions. IRCoT (Trivedi et al., 2023) goes further by interleaving Chain-of-Thought reasoning with retrieval — each reasoning step generates new retrieval needs. LangChain’s Multi-Query Retriever and LlamaIndex’s Sub-Question Query Engine follow the same paradigm.

User Profile Aggregation

Rather than treating each route as a separate retrieval anchor, aggregate them into a user preference profile:

DimensionAggregation MethodResult for This Case
Ability ceilingmax grade5.12a → recommend 5.12a–5.12c
Comfort zonemedian grade5.11c
Route typesport/trad/boulder ratiosport-leaning → weighted
Cragcrag distributionmulti-crag → no restriction
Exclusioncollect all route_idsall 5 excluded

At the embedding level, compute a centroid:

query_vector = mean([embed(Baihu), embed(Lightning), embed(Looks Like I Can), embed(Bubblegum), embed(Hsinchu Hakka)])

This is the classic average pooling of item embeddings from recommendation systems. The YouTube Recommendations paper (Covington et al., 2016) uses the average of watch-history embeddings as the user representation — the exact same concept.

Plan-and-Execute RAG

Introduce a planning phase that analyzes the complete intent before executing step by step:

Query → Planner → [Step 1: Extract all routes]
                   [Step 2: Build user profile]
                   [Step 3: Determine search criteria]
                   [Step 4: Vector search with aggregated filter]
                   [Step 5: Re-rank and exclude]
               → Executor → Response

This connects to the existing LangGraph architecture’s multi_tool path, but the key difference is that the planning phase in Plan-and-Execute is more structured (it doesn’t just decide which tools to use — it also decides how to aggregate intermediate results), and the execution phase includes a feedback loop. Wang et al. (2023)‘s Plan-and-Solve Prompting and Yao et al. (2023)‘s ReAct point in this direction.

Collaborative Filtering

Look at “what other users who climbed the same routes also climbed”:

-- Find users with similar taste
SELECT user_id, COUNT(*) as overlap
FROM ascents
WHERE route_id IN ('baihu_id', 'lightning_id', 'looks_like_i_can_id', 'bubblegum_id', 'hsinchu_hakka_id')
  AND user_id != current_user_id
GROUP BY user_id
ORDER BY overlap DESC
LIMIT 10;

-- Find recommendations from similar users
SELECT route_id, COUNT(*) as popularity
FROM ascents
WHERE user_id IN (top_10_similar_users)
  AND route_id NOT IN (excluded_routes)
GROUP BY route_id
ORDER BY popularity DESC
LIMIT 3;

The concept is elegant, but the constraints are practical: it requires sufficient user records, the computation is too heavy for Cloudflare Workers’ CPU time limits, and it’s better suited to offline precomputation rather than real-time queries.

Multi-Hop QA: The Academic Connection

Multi-entity queries are closely related to Multi-Hop QA. Key papers:

  • HotpotQA (Yang et al., 2018) — Multi-hop QA benchmark requiring reasoning across multiple documents
  • MDR (Xiong et al., 2021) — Iterative retrieval that adjusts each hop’s query based on the previous hop’s results
  • Baleen (Khattab et al., 2021) — Condensed retrieval that compresses intermediate results across hops for efficiency
  • DSP (Khattab et al., 2023) — Precursor to DSPy; demonstrate-search-predict pipeline

For Query Decomposition, Ma et al. (2023)‘s Query Rewriting and Shao et al. (2023)‘s ITER-RETGEN (iterative retrieval-generation) are worth studying. For multi-signal fusion in recommendation systems, SASRec (Kang & McAuley, 2018) uses self-attention over interaction sequences, and Li et al. (2023)‘s GPT4Rec frames recommendation directly as a language task.

Chosen Approach: Layered Implementation

Given the constraints of the Cloudflare Workers runtime, D1 SQLite, and the existing LangGraph architecture, a single big-bang rewrite wasn’t feasible. The approach breaks into three layers:

P0 (immediate): Multi-Entity Extraction

Smallest change, biggest impact. extractRouteReference()extractRouteReferences(), returning an array. The caller aggregates filters across multiple routes, and excludeRouteId becomes excludeRouteIds. No additional LLM calls, no latency increase. Estimated to resolve 80% of multi-entity query problems.

P1 (short-term): User Profile Aggregation

Add buildUserProfile() to compute the ability ceiling, comfort zone, preferred type, and crag distribution from multiple routes. The profile is injected into the LLM prompt’s system context, giving the model complete user background when generating responses.

P2 (medium-term): Embedding Centroid

If multiple routes have embeddings, compute the centroid vector as the query vector. Results are then re-ranked: exclude mentioned routes, apply difficulty-appropriateness weighting, and apply diversity weighting. This depends on whether the Cloudflare Vectorize API supports custom query vectors.

Integration points with the existing architecture:

nlp.ts            → extractRouteReferences() returns array
tool-selection.ts → routeRefs: RouteReference[], aggregated filter
GraphState        → excludeRouteIds: string[]
filter-build.ts   → handles multiple crags ($in instead of $eq)
vector search     → post-filter excludes multiple routes ($nin instead of $ne)

In Summary

The core trade-off is “maximize recommendation quality improvement with minimal additional computation, within edge runtime constraints.” P0’s rule-based multi-entity extraction is nearly zero-cost and resolves the majority of problems — the best return on investment. P1 adds profile aggregation so the LLM has fuller context, at the cost of a bit more string processing. P2’s embedding centroid is where computation and API limits actually come into play, so feasibility validation is needed first.

Query Decomposition and Collaborative Filtering are deferred for now. The former overlaps with the existing LangGraph multi-tool path; the latter needs more user data. Once P0–P2 are live, the next step depends on where the recommendation quality bottleneck actually sits.


References