Skip to content

LongRAG: Rethinking RAG Chunking Strategy with Long-Context Models

Mar 15, 2026 1 min
TL;DR Traditional RAG splits documents into small chunks for retrieval, but this causes information fragmentation. LongRAG leverages 100K+ token long-context models to retrieve larger document segments (entire sections or even whole documents), reducing fragmentation while maintaining retrieval efficiency.

🌏 中文版

The core assumption of traditional RAG is that LLM context windows are limited, so we must split documents into small pieces and only feed the most relevant fragments into the model.

This assumption was reasonable before 2024. But now Gemini has 1M tokens, Claude has 200K tokens, and GPT-4o has 128K tokens. When context windows explode from 4K to over 100K, RAG’s design logic should evolve accordingly.

LongRAG is the embodiment of this thinking: Stop splitting documents into fragments — retrieve larger units and let long-context models do the understanding.


The Problem with Small Chunks

Traditional RAG typically splits documents into small chunks of 256–512 tokens. This approach seems reasonable, but in practice it creates a series of problems.

Information Fragmentation

A complete argument gets split across multiple chunks, with each chunk containing only partial information.

Consider this example:

Original paragraph:
"Article 12 of the contract stipulates that Party A must complete payment
within 30 business days of receiving the acceptance report. If Party A
delays payment, a penalty of 0.05% per day shall be applied. However,
if the delay is caused by Party B's failure to provide complete acceptance
documents, Party A bears no liability for the delay."

Split with a 128-token chunk size:

Chunk 1: "Article 12 of the contract stipulates that Party A must complete payment within 30 business days of receiving the acceptance report."
Chunk 2: "If Party A delays payment, a penalty of 0.05% per day shall be applied."
Chunk 3: "However, if the delay is caused by Party B's failure to provide complete acceptance documents, Party A bears no liability for the delay."

The user asks: “How much penalty does Party A pay for late payment?”

The retrieval system might only find Chunk 2. But the correct answer requires Chunk 2 + Chunk 3 — because there’s an exception clause. If only Chunk 2 is retrieved, the LLM will produce an answer that appears correct but is incomplete.

This is information fragmentation: each chunk’s semantics are incomplete and must be combined with other chunks to reconstruct the full meaning.

Boundary Context Loss

Chunk boundaries are often arbitrary. A chain of reasoning that crosses a boundary gets broken:

End of Chunk A: "...therefore, we adopted the Transformer architecture. Specifically,"
Start of Chunk B: "we used a 6-layer encoder with Rotary Position Encoding (RoPE),"

Chunk B lacks the context of “why we adopted Transformer.” If the user asks “Why was this architecture chosen?”, Chunk B alone cannot answer the question.

Over-reliance on Retrieval Precision

Small chunks mean a large number of candidate fragments. A 100,000-character document split with 512-token chunks produces roughly 400 chunks. The retrieval system must precisely find the 3–5 most relevant ones from 400 candidates.

This places extremely high demands on retrieval:

  • Embeddings must accurately capture the semantics of each small chunk
  • Ranking must be precise, because the difference between top-3 and top-10 could mean the difference between “has the answer” and “doesn’t have the answer”
  • Multi-hop reasoning (answers scattered across multiple chunks) is nearly impossible to handle well

Small chunks transfer all the “understanding” pressure onto “retrieval.” And retrieval is never perfect.

Uneven Semantic Density

Different paragraphs have vastly different semantic densities. A 512-token legal clause might contain 5 important points, while a 512-token background introduction might contain only 1. Fixed-size chunks cannot reflect this variation.


The Opportunity with Long-Context Models

Between 2024 and 2025, mainstream LLMs experienced explosive growth in context window sizes:

ModelContext WindowRelease Date
GPT-3.54K tokens2023-03
Claude 2100K tokens2023-07
GPT-4 Turbo128K tokens2023-11
Gemini 1.5 Pro1M tokens2024-02
Claude 3.5 Sonnet200K tokens2024-06
Gemini 2.01M tokens2025-01

This changes the fundamental tradeoff in RAG:

Before: Context windows are scarce resources → must retrieve precisely → small chunks → high retrieval pressure.

Now: Context windows are abundant → can include more content → large chunks → shift pressure from retrieval to comprehension.

Key Insight

Long-context models excel at finding relevant information within large amounts of text. Research shows that even within a 100K-token context, good models can accurately locate specific facts embedded within it (needle-in-a-haystack tests).

This means: We don’t need perfect retrieval — just good-enough retrieval. Throw roughly relevant content at the LLM and let it find the answer itself.

From Precise Retrieval to Coarse-Grained Retrieval

Traditional RAG mindset: “Only give the LLM the most relevant fragments; don’t waste tokens.”

LongRAG mindset: “Give the LLM enough context and let it decide what’s relevant.”

This isn’t a regression — it’s leveraging advances in model capability. Rather than investing heavily in perfecting retrieval (better embeddings, more precise reranking, smarter query expansion), leverage the model’s own comprehension ability.


LongRAG Architecture

LongRAG’s core design is simple: increase retrieval granularity.

Traditional RAG vs LongRAG

Traditional RAG:
┌──────────┐     ┌──────────────────────────────────────┐
│          │     │  Document A                           │
│  Query   │     │  ┌─────┐┌─────┐┌─────┐┌─────┐...    │
│          │     │  │ c1  ││ c2  ││ c3  ││ c4  │       │
│          │     │  │512t ││512t ││512t ││512t │       │
│          │     │  └─────┘└─────┘└─────┘└─────┘       │
└────┬─────┘     └──────────────────────────────────────┘

     │ Vector search              From hundreds of chunks
     │ (top-k=5)                  precisely find 5

┌─────┐┌─────┐┌─────┐┌─────┐┌─────┐
│ c2  ││ c17 ││ c43 ││ c8  ││ c91 │   ← May miss key fragments
└─────┘└─────┘└─────┘└─────┘└─────┘

     ▼  Feed to LLM (~2,500 tokens)
┌──────────────────────────────┐
│  LLM answers from fragmented │
│  context                     │
└──────────────────────────────┘


LongRAG:
┌──────────┐     ┌──────────────────────────────────────┐
│          │     │  Document A                           │
│  Query   │     │  ┌──────────────┐┌──────────────┐    │
│          │     │  │  Section 1   ││  Section 2   │    │
│          │     │  │  (~6,000t)   ││  (~8,000t)   │    │
│          │     │  └──────────────┘└──────────────┘    │
└────┬─────┘     └──────────────────────────────────────┘

     │ Vector search              From a small number of
     │ (top-k=3)                  segments, find roughly
     ▼                            relevant ones
┌──────────────┐┌──────────────┐┌──────────────┐
│  Section 2   ││  Section 5   ││  Section 11  │  ← Complete semantic units
└──────────────┘└──────────────┘└──────────────┘

     ▼  Feed to LLM (~20,000 tokens)
┌──────────────────────────────┐
│  LLM answers from complete   │
│  context                     │
└──────────────────────────────┘

The fundamental differences:

  • Traditional RAG: Hundreds of small chunks → precise retrieval → fragmented context → LLM pieces together an answer
  • LongRAG: Dozens of large segments → coarse-grained retrieval → complete context → LLM directly comprehends

Choosing Retrieval Units

LongRAG’s “large chunks” aren’t just small chunks stitched together arbitrarily. It uses the document’s own structure as splitting boundaries:

Retrieval UnitSize RangeUse Case
Paragraph groups1,000–3,000 tokensDocuments with unclear structure
Sections3,000–10,000 tokensDocuments with heading structure
Sub-documents10,000–30,000 tokensIndependent chapters in long documents
Entire documents30,000–100,000 tokensShort to medium-length standalone documents

Key principle: Splitting boundaries should align with semantic boundaries, not fixed token counts.


Redesigning the Chunking Strategy

LongRAG isn’t about “not chunking” — it’s about “chunking smarter.” Here are three main strategies.

Strategy 1: Document-Level Retrieval

The most extreme approach: each document is a single retrieval unit.

Index structure:
Document A → one embedding (representing the entire document's semantics)
Document B → one embedding
Document C → one embedding
...

Retrieval: find the 1–3 most relevant documents, feed full text to LLM.

Pros:

  • No chunking needed at all
  • Zero information fragmentation
  • Simplest implementation

Cons:

  • A single document may exceed the context window
  • One embedding struggles to represent all topics in a long document
  • Lowest retrieval precision (document-level is too coarse)

Best for:

  • Document collections where each document is relatively short (< 30K tokens)
  • Documents with a single topic that don’t cover multiple unrelated subjects
  • Small document collections (< 1,000 documents)

Strategy 2: Section-Level Retrieval

Split by the document’s section structure, with each section as a retrieval unit.

Index structure:
Document A / Section 1 → one embedding
Document A / Section 2 → one embedding
Document A / Section 3 → one embedding
Document B / Section 1 → one embedding
...

Retrieval: find the 2–5 most relevant sections, combine and feed to LLM.

Pros:

  • Preserves semantic completeness (sections are typically complete units of discourse)
  • Moderate retrieval granularity, more precise than document-level
  • Leverages the document’s own structure — no manual judgment needed for split points

Cons:

  • Requires documents with clear section structure (headings, table of contents, etc.)
  • Section sizes can vary dramatically (some 500 tokens, others 20,000 tokens)
  • Cross-section references are still lost

Best for:

  • Technical documentation, academic papers, legal documents
  • Documents with Markdown/HTML heading structure
  • Documents with multiple independently understandable topics

Strategy 3: Hybrid Strategy (Coarse Retrieval + Fine Reading)

This is LongRAG’s recommended approach: a two-stage architecture.

Stage 1 - Coarse-grained retrieval:
  Use section/document-level embeddings to find roughly relevant segments

Stage 2 - Fine-grained reading:
  LLM finds precise answers within the large segments

Equivalent to:
  Traditional RAG's "retriever + reader"
  But the retriever is coarser, and the reader is stronger

Process:

  1. During indexing, split at the section level (3,000–10,000 tokens)
  2. During retrieval, take top-k=3–5 sections (roughly 15,000–50,000 tokens total)
  3. Combine these sections into a single large context and feed to a long-context LLM
  4. The LLM finds precise answers within the large context

Pros:

  • Combines high recall from coarse retrieval with precise comprehension from the LLM
  • Doesn’t require perfect retrieval (the LLM filters out irrelevant content for you)
  • Highly adaptable — different documents can use different granularities

Strategy Comparison

DimensionDocument-LevelSection-LevelHybrid Strategy
Splitting granularityEntire document3K–10K tokens3K–10K tokens
Index sizeSmallestMediumMedium
Retrieval precisionLowMediumMedium + LLM compensation
Context completenessHighestHighHigh
Token consumptionHighestMedium-highMedium-high
Implementation complexityLowestMediumMedium-high
Required model1M context100K+ context100K+ context

Retrieval Efficiency Considerations

Large chunks change the performance characteristics of the retrieval system.

Dramatically Fewer Candidates

For the same corpus:

Traditional RAG (512 tokens/chunk):
  100K-character document × 100 documents = ~80,000 chunks
  Vector search space: 80,000 vectors

LongRAG (section-level, ~5,000 tokens/section):
  100K-character document × 100 documents = ~8,000 sections
  Vector search space: 8,000 vectors

A 10x reduction in candidates directly provides:

  • Faster vector search: ANN algorithms like HNSW are faster on smaller indexes
  • Lower storage costs: Fewer vectors = less memory and disk space
  • Simpler index maintenance: Fewer vectors to update when adding/removing documents

Precision vs. Recall Tradeoff

Large chunks naturally have higher recall (each chunk covers more content), but precision may decrease (more irrelevant content within each chunk).

Small chunks (512 tokens):
  ✓ High precision — each chunk is topic-focused
  ✗ Low recall — answer may be split across adjacent chunks
  ✗ Requires more precise retrieval

Large chunks (5,000 tokens):
  ✗ Lower precision — chunks may contain irrelevant content
  ✓ High recall — answer is more likely to be within a selected chunk
  ✓ High fault tolerance — can find answers even with imperfect retrieval

Balancing Strategies

Several methods to reduce precision loss with large chunks:

1. Multi-level Indexing

Maintain both section-level and paragraph-level embeddings simultaneously. Use section-level retrieval to narrow the scope first, then use paragraph-level for ranking.

2. Summary Embeddings

Instead of embedding the entire section text, first use an LLM to generate a section summary, then embed the summary. Summaries are more condensed with higher semantic density, resulting in better embedding quality.

3. Multi-vector Representation

Generate multiple embeddings for a single section (e.g., section summary + first sentence of each paragraph within the section). A hit on any of them counts as that section being relevant.

4. Post-retrieval Reranking

After retrieving large chunks, use a cross-encoder or LLM to perform secondary ranking on paragraphs within the chunk, prioritizing the most relevant portions.


Comparison with Traditional RAG

Here’s a detailed comparison between LongRAG and traditional RAG across multiple dimensions:

DimensionTraditional RAGLongRAG
Chunk size256–512 tokens3,000–100,000 tokens
Vectors in indexMany (hundreds per document)Few (single digits to dozens per document)
Retrieval precisionHigh (each chunk is topic-focused)Medium (mixed topics within chunks)
Context coherenceLow (fragmented)High (complete semantic units)
Token consumption/queryLow (~2K–5K tokens)High (~15K–50K tokens)
Inference latencyLowerHigher (more tokens to process)
Retrieval latencyHigher (large index)Lower (small index)
Multi-hop reasoningWeak (requires multiple retrievals)Strong (large context naturally covers it)
Dependence on retrieval qualityExtremely highModerate
LLM requirementsAny modelRequires long-context model
Cost per queryLowerHigher
Build complexityHigh (needs fine-grained chunking + reranking)Low (coarse-grained splitting suffices)

Concrete Performance Data

Based on the LongRAG paper and related research results:

NQ (Natural Questions) dataset:

  • Traditional RAG (100-word chunks, top-5): Answer hit rate ~52%
  • LongRAG (4K+ token groups, top-4): Answer hit rate ~71%
  • Improvement: ~19 percentage points

The key reason is improved recall. Traditional RAG’s top-5 small chunks frequently miss the fragment containing the answer, while LongRAG’s large chunks are more likely to encompass it.

Cost Analysis

LongRAG trades more tokens for better answer quality:

Traditional RAG per query:
  Retrieval: 5 chunks × 512 tokens = 2,560 input tokens
  Generation: ~200 output tokens
  Cost (Claude 3.5): ~$0.008

LongRAG per query:
  Retrieval: 3 sections × 6,000 tokens = 18,000 input tokens
  Generation: ~200 output tokens
  Cost (Claude 3.5): ~$0.055

Cost increase of approximately 7x, but with significantly improved answer quality.

Whether this tradeoff is worthwhile depends on the use case. For high-value queries in legal, medical, or financial domains, paying 7x more for more accurate and complete answers is reasonable. For low-value everyday Q&A, traditional RAG may be more economical.


Implementation Guide

Here’s a complete LongRAG retrieval pipeline implementation in TypeScript.

Section-Level Splitter

interface Section {
  id: string;
  documentId: string;
  title: string;
  content: string;
  tokenCount: number;
  embedding: number[];
  metadata: { level: number; position: number };
}

/**
 * Split documents by section structure, not fixed token counts.
 * Sections exceeding maxTokens are recursively split at the next heading level;
 * adjacent sections that are too small are automatically merged.
 */
function splitBySection(
  document: { id: string; content: string },
  maxTokens = 8000,
  minTokens = 500,
): Section[] {
  const lines = document.content.split('\n');
  const sections: Section[] = [];
  let currentTitle = '';
  let currentContent: string[] = [];
  let level = 1;
  let idx = 0;

  function flush() {
    if (!currentContent.length) return;
    const content = currentContent.join('\n');
    const tokens = estimateTokens(content);
    sections.push({
      id: `${document.id}_s${idx}`,
      documentId: document.id,
      title: currentTitle,
      content,
      tokenCount: tokens,
      embedding: [],
      metadata: { level, position: idx++ },
    });
    currentContent = [];
  }

  for (const line of lines) {
    const m = line.match(/^(#{1,6})\s+(.+)/);
    if (m) {
      flush();
      level = m[1].length;
      currentTitle = m[2].trim();
      currentContent = [line];
    } else {
      currentContent.push(line);
    }
  }
  flush();

  // Merge adjacent sections that are too small
  return sections.reduce<Section[]>((merged, section) => {
    const prev = merged[merged.length - 1];
    if (prev && prev.tokenCount < minTokens) {
      prev.content += '\n\n' + section.content;
      prev.tokenCount += section.tokenCount;
    } else {
      merged.push({ ...section });
    }
    return merged;
  }, []);
}

function estimateTokens(text: string): number {
  const zh = (text.match(/[一-鿿]/g) || []).length;
  const en = text.replace(/[一-鿿]/g, '').split(/\s+/).length;
  return Math.ceil(zh * 1.5 + en * 1.3);
}

LongRAG Retrieval Pipeline

import { cosineSimilarity, generateEmbedding } from './utils';

interface RetrievalResult {
  sections: Section[];
  totalTokens: number;
  query: string;
}

interface LongRAGConfig {
  topK: number;            // How many sections to retrieve
  maxContextTokens: number; // Maximum context tokens
  minRelevanceScore: number; // Minimum relevance score
}

/**
 * Core retrieval logic for LongRAG.
 * Key differences from traditional RAG:
 * 1. Retrieval units are sections (thousands of tokens), not small chunks (hundreds of tokens)
 * 2. top-k is smaller (3-5), because each result is already large
 * 3. Has token budget control to avoid exceeding the LLM's context window
 */
async function retrieveSections(
  query: string,
  index: Section[],  // Each section already has an embedding
  config: LongRAGConfig = { topK: 5, maxContextTokens: 50000, minRelevanceScore: 0.3 },
): Promise<RetrievalResult> {
  const queryEmbedding = await generateEmbedding(query);

  // Compute similarity → filter → sort
  const scored = index
    .map((section) => ({
      section,
      score: cosineSimilarity(queryEmbedding, section.embedding),
    }))
    .filter((item) => item.score >= config.minRelevanceScore)
    .sort((a, b) => b.score - a.score);

  // Select top-k sections within token budget
  const selected: Section[] = [];
  let totalTokens = 0;

  for (const item of scored) {
    if (selected.length >= config.topK) break;
    if (totalTokens + item.section.tokenCount > config.maxContextTokens) continue;
    selected.push(item.section);
    totalTokens += item.section.tokenCount;
  }

  // Sort by original position (preserve reading order)
  selected.sort((a, b) => {
    if (a.documentId !== b.documentId) return a.documentId.localeCompare(b.documentId);
    return a.metadata.position - b.metadata.position;
  });

  return { sections: selected, totalTokens, query };
}

Complete Pipeline: Retrieval + Generation

/**
 * Full LongRAG pipeline: split → index → retrieve → generate answer
 */
async function longRAGPipeline(
  documents: { id: string; content: string }[],
  query: string,
) {
  // Step 1: Section splitting
  const allSections = documents.flatMap((doc) => splitBySection(doc, 8000, 500));

  // Step 2: Retrieval (assuming sections already have embeddings)
  const retrieval = await retrieveSections(query, allSections, {
    topK: 4, maxContextTokens: 40000, minRelevanceScore: 0.25,
  });

  // Step 3: Assemble context and generate answer
  const context = retrieval.sections
    .map((s, i) => `=== Source ${i + 1}: ${s.title} ===\n${s.content}`)
    .join('\n---\n\n');

  const answer = await callLLM({
    model: 'claude-sonnet-4-20250514',
    system: `Answer the question based on the reference materials. Cite source numbers, and clearly state when information is insufficient.`,
    messages: [{ role: 'user', content: `Reference materials:\n${context}\n\nQuestion: ${query}` }],
    maxTokens: 1000,
  });

  return {
    answer,
    sources: retrieval.sections.map((s) => ({ documentId: s.documentId, title: s.title })),
    tokensUsed: retrieval.totalTokens,
  };
}

Use Cases

LongRAG isn’t a silver bullet. Here are the scenarios where it particularly excels and where it doesn’t.

Particularly Well-Suited

1. Lengthy Legal Documents

Legal texts are characterized by extensive cross-references between clauses, where the meaning of one clause depends on the context of others. Traditional RAG breaks all these cross-references when it splits contracts into small chunks. LongRAG preserves entire sections or even entire contracts, letting the LLM see the complete relationships between clauses.

User asks: "Can the owner terminate the contract if the contractor delays delivery?"

Traditional RAG might only find:
  "If delivery is delayed by more than 30 days, the owner has the right to terminate the contract."

LongRAG would find the entire "Termination Clause" section, including:
  - Definition of delivery delay
  - 30-day grace period
  - Force majeure exceptions
  - Written notice requirement before termination
  - Post-termination settlement procedures

2. Academic Papers

A paper’s methodology, experimental design, and results analysis are typically spread across different sections but are closely related. LongRAG can retrieve the entire “Method + Experiments” segment at once, letting the LLM understand the causal relationship between methods and results.

3. Technical Manuals and API Documentation

Technical concepts typically require complete context to understand. An API endpoint’s behavior may depend on authentication settings, rate limit policies, error code definitions, and other information scattered across different sections. Large chunks make it easier to retrieve all this dispersed information at once.

4. Multi-Hop Reasoning Queries

When an answer requires synthesizing information from multiple paragraphs, LongRAG has a natural advantage:

Question: "Under what circumstances can the company not pay year-end bonuses?"

Requires synthesizing:
  - Year-end bonus calculation rules (Chapter 4)
  - Employee evaluation criteria (Chapter 7)
  - Special exception clauses (Chapter 12)

LongRAG is more likely to retrieve all three chapters.

5. Scenarios Where Context Coherence Matters More Than Precision

Customer service knowledge bases, product FAQs, policy manuals — in these scenarios, providing users with a complete, coherent answer matters more than precisely citing a specific paragraph. LongRAG’s large context enables the LLM to generate smoother, more complete responses.

Less Suitable

1. Precise Fact Lookups in Very Large Corpora

If your corpus contains millions of documents and users are simply looking for a specific number or date, traditional RAG’s small chunks + precise retrieval is more efficient. LongRAG would consume a large amount of unnecessary tokens in this scenario.

2. Low-Latency Requirements

LongRAG feeds 5–10x more tokens to the LLM than traditional RAG, with inference latency increasing proportionally. For scenarios requiring millisecond-level responses (such as real-time search suggestions), this may be unacceptable.

3. Cost-Sensitive High-Frequency Queries

Consuming 15K–50K input tokens per query, with tens of thousands of queries per day, makes token costs very significant.


Limitations and Challenges

1. Requires Long-Context LLMs

LongRAG’s prerequisite is that the LLM can handle a large number of input tokens. If your model only has an 8K–16K context window, LongRAG’s large chunks simply won’t fit.

Currently, models supporting 100K+ context are still relatively few, and most are paid APIs (Claude, Gemini, GPT-4). Locally deployed open-source models are mostly still in the 8K–32K range.

2. Token Costs Scale Linearly

More input tokens directly means higher API costs. Simple calculation:

Assuming 10,000 queries per day:

Traditional RAG:
  10,000 × 3,000 tokens × $3/1M = $90/day

LongRAG:
  10,000 × 25,000 tokens × $3/1M = $750/day

Annualized difference: $32,850 vs $273,750

This cost gap is impossible to ignore at high traffic volumes.

3. Increased Inference Latency

LLM inference time is roughly proportional to input tokens. Processing 25K tokens takes approximately 5–8x longer than 3K tokens. In user-experience-sensitive scenarios (such as chatbots), this latency may be unacceptable.

Mitigation strategies:

  • Use streaming responses so the time-to-first-token remains unchanged
  • Cache results for popular queries
  • Fall back to traditional RAG for simple queries that don’t need long context

4. Embedding Quality Degrades with Text Length

Existing embedding models (e.g., text-embedding-3-large) produce lower-quality semantic representations when processing long text. A 5,000-token passage may cover multiple topics, and a single embedding vector struggles to capture all of them simultaneously.

Mitigation strategies:

  • Use summary embeddings (the approach described earlier)
  • Use ColBERT-style multi-vector representations
  • Use multiple embeddings to represent a single section

5. The “Lost in the Middle” Problem

Research shows that LLMs pay weaker attention to middle portions of long contexts (relative to the beginning and end). If critical information happens to be in the middle of a long context, the LLM may overlook it.

Mitigation strategies:

  • Place the most relevant sections at the beginning and end of the context
  • Explicitly remind the LLM in the prompt to attend to all sections
  • Limit total context length — don’t stuff content in without limits

6. Lack of Standardized Evaluation Benchmarks

Traditional RAG has mature evaluation frameworks (Precision@K, Recall@K, MRR, etc.), but evaluating LongRAG is more difficult. Because retrieval unit sizes differ, directly comparing Precision@K is unfair. There are currently no standardized evaluation benchmarks for LongRAG scenarios.


Summary

LongRAG’s core insight is simple: When the LLM’s comprehension ability is strong enough and the context window is large enough, you don’t need to put all the pressure on retrieval.

Traditional RAG’s design was reasonable in the era of small context windows: precise splitting, precise retrieval, giving the LLM only the most essential information. But the cost of this strategy is information fragmentation and over-reliance on retrieval precision.

LongRAG redistributes this pressure:

  • Retrieval: Shifts from “precisely find the most relevant small fragments” to “roughly find relevant large segments”
  • Comprehension: Shifts from “piece together answers from fragmented context” to “understand and answer within complete context”

This isn’t meant to replace traditional RAG, but rather provides another effective design choice now that long-context models are widespread. Choose the most suitable strategy based on your document characteristics, query types, cost budget, and latency requirements.

The most pragmatic approach is likely a hybrid strategy: use traditional RAG for simple queries to save tokens, and switch to LongRAG for complex queries to improve quality. A single query classifier can make this happen.


Further Reading

References