🌏 中文版
The core assumption of traditional RAG is that LLM context windows are limited, so we must split documents into small pieces and only feed the most relevant fragments into the model.
This assumption was reasonable before 2024. But now Gemini has 1M tokens, Claude has 200K tokens, and GPT-4o has 128K tokens. When context windows explode from 4K to over 100K, RAG’s design logic should evolve accordingly.
LongRAG is the embodiment of this thinking: Stop splitting documents into fragments — retrieve larger units and let long-context models do the understanding.
The Problem with Small Chunks
Traditional RAG typically splits documents into small chunks of 256–512 tokens. This approach seems reasonable, but in practice it creates a series of problems.
Information Fragmentation
A complete argument gets split across multiple chunks, with each chunk containing only partial information.
Consider this example:
Original paragraph:
"Article 12 of the contract stipulates that Party A must complete payment
within 30 business days of receiving the acceptance report. If Party A
delays payment, a penalty of 0.05% per day shall be applied. However,
if the delay is caused by Party B's failure to provide complete acceptance
documents, Party A bears no liability for the delay."
Split with a 128-token chunk size:
Chunk 1: "Article 12 of the contract stipulates that Party A must complete payment within 30 business days of receiving the acceptance report."
Chunk 2: "If Party A delays payment, a penalty of 0.05% per day shall be applied."
Chunk 3: "However, if the delay is caused by Party B's failure to provide complete acceptance documents, Party A bears no liability for the delay."
The user asks: “How much penalty does Party A pay for late payment?”
The retrieval system might only find Chunk 2. But the correct answer requires Chunk 2 + Chunk 3 — because there’s an exception clause. If only Chunk 2 is retrieved, the LLM will produce an answer that appears correct but is incomplete.
This is information fragmentation: each chunk’s semantics are incomplete and must be combined with other chunks to reconstruct the full meaning.
Boundary Context Loss
Chunk boundaries are often arbitrary. A chain of reasoning that crosses a boundary gets broken:
End of Chunk A: "...therefore, we adopted the Transformer architecture. Specifically,"
Start of Chunk B: "we used a 6-layer encoder with Rotary Position Encoding (RoPE),"
Chunk B lacks the context of “why we adopted Transformer.” If the user asks “Why was this architecture chosen?”, Chunk B alone cannot answer the question.
Over-reliance on Retrieval Precision
Small chunks mean a large number of candidate fragments. A 100,000-character document split with 512-token chunks produces roughly 400 chunks. The retrieval system must precisely find the 3–5 most relevant ones from 400 candidates.
This places extremely high demands on retrieval:
- Embeddings must accurately capture the semantics of each small chunk
- Ranking must be precise, because the difference between top-3 and top-10 could mean the difference between “has the answer” and “doesn’t have the answer”
- Multi-hop reasoning (answers scattered across multiple chunks) is nearly impossible to handle well
Small chunks transfer all the “understanding” pressure onto “retrieval.” And retrieval is never perfect.
Uneven Semantic Density
Different paragraphs have vastly different semantic densities. A 512-token legal clause might contain 5 important points, while a 512-token background introduction might contain only 1. Fixed-size chunks cannot reflect this variation.
The Opportunity with Long-Context Models
Between 2024 and 2025, mainstream LLMs experienced explosive growth in context window sizes:
| Model | Context Window | Release Date |
|---|---|---|
| GPT-3.5 | 4K tokens | 2023-03 |
| Claude 2 | 100K tokens | 2023-07 |
| GPT-4 Turbo | 128K tokens | 2023-11 |
| Gemini 1.5 Pro | 1M tokens | 2024-02 |
| Claude 3.5 Sonnet | 200K tokens | 2024-06 |
| Gemini 2.0 | 1M tokens | 2025-01 |
This changes the fundamental tradeoff in RAG:
Before: Context windows are scarce resources → must retrieve precisely → small chunks → high retrieval pressure.
Now: Context windows are abundant → can include more content → large chunks → shift pressure from retrieval to comprehension.
Key Insight
Long-context models excel at finding relevant information within large amounts of text. Research shows that even within a 100K-token context, good models can accurately locate specific facts embedded within it (needle-in-a-haystack tests).
This means: We don’t need perfect retrieval — just good-enough retrieval. Throw roughly relevant content at the LLM and let it find the answer itself.
From Precise Retrieval to Coarse-Grained Retrieval
Traditional RAG mindset: “Only give the LLM the most relevant fragments; don’t waste tokens.”
LongRAG mindset: “Give the LLM enough context and let it decide what’s relevant.”
This isn’t a regression — it’s leveraging advances in model capability. Rather than investing heavily in perfecting retrieval (better embeddings, more precise reranking, smarter query expansion), leverage the model’s own comprehension ability.
LongRAG Architecture
LongRAG’s core design is simple: increase retrieval granularity.
Traditional RAG vs LongRAG
Traditional RAG:
┌──────────┐ ┌──────────────────────────────────────┐
│ │ │ Document A │
│ Query │ │ ┌─────┐┌─────┐┌─────┐┌─────┐... │
│ │ │ │ c1 ││ c2 ││ c3 ││ c4 │ │
│ │ │ │512t ││512t ││512t ││512t │ │
│ │ │ └─────┘└─────┘└─────┘└─────┘ │
└────┬─────┘ └──────────────────────────────────────┘
│
│ Vector search From hundreds of chunks
│ (top-k=5) precisely find 5
▼
┌─────┐┌─────┐┌─────┐┌─────┐┌─────┐
│ c2 ││ c17 ││ c43 ││ c8 ││ c91 │ ← May miss key fragments
└─────┘└─────┘└─────┘└─────┘└─────┘
│
▼ Feed to LLM (~2,500 tokens)
┌──────────────────────────────┐
│ LLM answers from fragmented │
│ context │
└──────────────────────────────┘
LongRAG:
┌──────────┐ ┌──────────────────────────────────────┐
│ │ │ Document A │
│ Query │ │ ┌──────────────┐┌──────────────┐ │
│ │ │ │ Section 1 ││ Section 2 │ │
│ │ │ │ (~6,000t) ││ (~8,000t) │ │
│ │ │ └──────────────┘└──────────────┘ │
└────┬─────┘ └──────────────────────────────────────┘
│
│ Vector search From a small number of
│ (top-k=3) segments, find roughly
▼ relevant ones
┌──────────────┐┌──────────────┐┌──────────────┐
│ Section 2 ││ Section 5 ││ Section 11 │ ← Complete semantic units
└──────────────┘└──────────────┘└──────────────┘
│
▼ Feed to LLM (~20,000 tokens)
┌──────────────────────────────┐
│ LLM answers from complete │
│ context │
└──────────────────────────────┘
The fundamental differences:
- Traditional RAG: Hundreds of small chunks → precise retrieval → fragmented context → LLM pieces together an answer
- LongRAG: Dozens of large segments → coarse-grained retrieval → complete context → LLM directly comprehends
Choosing Retrieval Units
LongRAG’s “large chunks” aren’t just small chunks stitched together arbitrarily. It uses the document’s own structure as splitting boundaries:
| Retrieval Unit | Size Range | Use Case |
|---|---|---|
| Paragraph groups | 1,000–3,000 tokens | Documents with unclear structure |
| Sections | 3,000–10,000 tokens | Documents with heading structure |
| Sub-documents | 10,000–30,000 tokens | Independent chapters in long documents |
| Entire documents | 30,000–100,000 tokens | Short to medium-length standalone documents |
Key principle: Splitting boundaries should align with semantic boundaries, not fixed token counts.
Redesigning the Chunking Strategy
LongRAG isn’t about “not chunking” — it’s about “chunking smarter.” Here are three main strategies.
Strategy 1: Document-Level Retrieval
The most extreme approach: each document is a single retrieval unit.
Index structure:
Document A → one embedding (representing the entire document's semantics)
Document B → one embedding
Document C → one embedding
...
Retrieval: find the 1–3 most relevant documents, feed full text to LLM.
Pros:
- No chunking needed at all
- Zero information fragmentation
- Simplest implementation
Cons:
- A single document may exceed the context window
- One embedding struggles to represent all topics in a long document
- Lowest retrieval precision (document-level is too coarse)
Best for:
- Document collections where each document is relatively short (< 30K tokens)
- Documents with a single topic that don’t cover multiple unrelated subjects
- Small document collections (< 1,000 documents)
Strategy 2: Section-Level Retrieval
Split by the document’s section structure, with each section as a retrieval unit.
Index structure:
Document A / Section 1 → one embedding
Document A / Section 2 → one embedding
Document A / Section 3 → one embedding
Document B / Section 1 → one embedding
...
Retrieval: find the 2–5 most relevant sections, combine and feed to LLM.
Pros:
- Preserves semantic completeness (sections are typically complete units of discourse)
- Moderate retrieval granularity, more precise than document-level
- Leverages the document’s own structure — no manual judgment needed for split points
Cons:
- Requires documents with clear section structure (headings, table of contents, etc.)
- Section sizes can vary dramatically (some 500 tokens, others 20,000 tokens)
- Cross-section references are still lost
Best for:
- Technical documentation, academic papers, legal documents
- Documents with Markdown/HTML heading structure
- Documents with multiple independently understandable topics
Strategy 3: Hybrid Strategy (Coarse Retrieval + Fine Reading)
This is LongRAG’s recommended approach: a two-stage architecture.
Stage 1 - Coarse-grained retrieval:
Use section/document-level embeddings to find roughly relevant segments
Stage 2 - Fine-grained reading:
LLM finds precise answers within the large segments
Equivalent to:
Traditional RAG's "retriever + reader"
But the retriever is coarser, and the reader is stronger
Process:
- During indexing, split at the section level (3,000–10,000 tokens)
- During retrieval, take top-k=3–5 sections (roughly 15,000–50,000 tokens total)
- Combine these sections into a single large context and feed to a long-context LLM
- The LLM finds precise answers within the large context
Pros:
- Combines high recall from coarse retrieval with precise comprehension from the LLM
- Doesn’t require perfect retrieval (the LLM filters out irrelevant content for you)
- Highly adaptable — different documents can use different granularities
Strategy Comparison
| Dimension | Document-Level | Section-Level | Hybrid Strategy |
|---|---|---|---|
| Splitting granularity | Entire document | 3K–10K tokens | 3K–10K tokens |
| Index size | Smallest | Medium | Medium |
| Retrieval precision | Low | Medium | Medium + LLM compensation |
| Context completeness | Highest | High | High |
| Token consumption | Highest | Medium-high | Medium-high |
| Implementation complexity | Lowest | Medium | Medium-high |
| Required model | 1M context | 100K+ context | 100K+ context |
Retrieval Efficiency Considerations
Large chunks change the performance characteristics of the retrieval system.
Dramatically Fewer Candidates
For the same corpus:
Traditional RAG (512 tokens/chunk):
100K-character document × 100 documents = ~80,000 chunks
Vector search space: 80,000 vectors
LongRAG (section-level, ~5,000 tokens/section):
100K-character document × 100 documents = ~8,000 sections
Vector search space: 8,000 vectors
A 10x reduction in candidates directly provides:
- Faster vector search: ANN algorithms like HNSW are faster on smaller indexes
- Lower storage costs: Fewer vectors = less memory and disk space
- Simpler index maintenance: Fewer vectors to update when adding/removing documents
Precision vs. Recall Tradeoff
Large chunks naturally have higher recall (each chunk covers more content), but precision may decrease (more irrelevant content within each chunk).
Small chunks (512 tokens):
✓ High precision — each chunk is topic-focused
✗ Low recall — answer may be split across adjacent chunks
✗ Requires more precise retrieval
Large chunks (5,000 tokens):
✗ Lower precision — chunks may contain irrelevant content
✓ High recall — answer is more likely to be within a selected chunk
✓ High fault tolerance — can find answers even with imperfect retrieval
Balancing Strategies
Several methods to reduce precision loss with large chunks:
1. Multi-level Indexing
Maintain both section-level and paragraph-level embeddings simultaneously. Use section-level retrieval to narrow the scope first, then use paragraph-level for ranking.
2. Summary Embeddings
Instead of embedding the entire section text, first use an LLM to generate a section summary, then embed the summary. Summaries are more condensed with higher semantic density, resulting in better embedding quality.
3. Multi-vector Representation
Generate multiple embeddings for a single section (e.g., section summary + first sentence of each paragraph within the section). A hit on any of them counts as that section being relevant.
4. Post-retrieval Reranking
After retrieving large chunks, use a cross-encoder or LLM to perform secondary ranking on paragraphs within the chunk, prioritizing the most relevant portions.
Comparison with Traditional RAG
Here’s a detailed comparison between LongRAG and traditional RAG across multiple dimensions:
| Dimension | Traditional RAG | LongRAG |
|---|---|---|
| Chunk size | 256–512 tokens | 3,000–100,000 tokens |
| Vectors in index | Many (hundreds per document) | Few (single digits to dozens per document) |
| Retrieval precision | High (each chunk is topic-focused) | Medium (mixed topics within chunks) |
| Context coherence | Low (fragmented) | High (complete semantic units) |
| Token consumption/query | Low (~2K–5K tokens) | High (~15K–50K tokens) |
| Inference latency | Lower | Higher (more tokens to process) |
| Retrieval latency | Higher (large index) | Lower (small index) |
| Multi-hop reasoning | Weak (requires multiple retrievals) | Strong (large context naturally covers it) |
| Dependence on retrieval quality | Extremely high | Moderate |
| LLM requirements | Any model | Requires long-context model |
| Cost per query | Lower | Higher |
| Build complexity | High (needs fine-grained chunking + reranking) | Low (coarse-grained splitting suffices) |
Concrete Performance Data
Based on the LongRAG paper and related research results:
NQ (Natural Questions) dataset:
- Traditional RAG (100-word chunks, top-5): Answer hit rate ~52%
- LongRAG (4K+ token groups, top-4): Answer hit rate ~71%
- Improvement: ~19 percentage points
The key reason is improved recall. Traditional RAG’s top-5 small chunks frequently miss the fragment containing the answer, while LongRAG’s large chunks are more likely to encompass it.
Cost Analysis
LongRAG trades more tokens for better answer quality:
Traditional RAG per query:
Retrieval: 5 chunks × 512 tokens = 2,560 input tokens
Generation: ~200 output tokens
Cost (Claude 3.5): ~$0.008
LongRAG per query:
Retrieval: 3 sections × 6,000 tokens = 18,000 input tokens
Generation: ~200 output tokens
Cost (Claude 3.5): ~$0.055
Cost increase of approximately 7x, but with significantly improved answer quality.
Whether this tradeoff is worthwhile depends on the use case. For high-value queries in legal, medical, or financial domains, paying 7x more for more accurate and complete answers is reasonable. For low-value everyday Q&A, traditional RAG may be more economical.
Implementation Guide
Here’s a complete LongRAG retrieval pipeline implementation in TypeScript.
Section-Level Splitter
interface Section {
id: string;
documentId: string;
title: string;
content: string;
tokenCount: number;
embedding: number[];
metadata: { level: number; position: number };
}
/**
* Split documents by section structure, not fixed token counts.
* Sections exceeding maxTokens are recursively split at the next heading level;
* adjacent sections that are too small are automatically merged.
*/
function splitBySection(
document: { id: string; content: string },
maxTokens = 8000,
minTokens = 500,
): Section[] {
const lines = document.content.split('\n');
const sections: Section[] = [];
let currentTitle = '';
let currentContent: string[] = [];
let level = 1;
let idx = 0;
function flush() {
if (!currentContent.length) return;
const content = currentContent.join('\n');
const tokens = estimateTokens(content);
sections.push({
id: `${document.id}_s${idx}`,
documentId: document.id,
title: currentTitle,
content,
tokenCount: tokens,
embedding: [],
metadata: { level, position: idx++ },
});
currentContent = [];
}
for (const line of lines) {
const m = line.match(/^(#{1,6})\s+(.+)/);
if (m) {
flush();
level = m[1].length;
currentTitle = m[2].trim();
currentContent = [line];
} else {
currentContent.push(line);
}
}
flush();
// Merge adjacent sections that are too small
return sections.reduce<Section[]>((merged, section) => {
const prev = merged[merged.length - 1];
if (prev && prev.tokenCount < minTokens) {
prev.content += '\n\n' + section.content;
prev.tokenCount += section.tokenCount;
} else {
merged.push({ ...section });
}
return merged;
}, []);
}
function estimateTokens(text: string): number {
const zh = (text.match(/[一-鿿]/g) || []).length;
const en = text.replace(/[一-鿿]/g, '').split(/\s+/).length;
return Math.ceil(zh * 1.5 + en * 1.3);
}
LongRAG Retrieval Pipeline
import { cosineSimilarity, generateEmbedding } from './utils';
interface RetrievalResult {
sections: Section[];
totalTokens: number;
query: string;
}
interface LongRAGConfig {
topK: number; // How many sections to retrieve
maxContextTokens: number; // Maximum context tokens
minRelevanceScore: number; // Minimum relevance score
}
/**
* Core retrieval logic for LongRAG.
* Key differences from traditional RAG:
* 1. Retrieval units are sections (thousands of tokens), not small chunks (hundreds of tokens)
* 2. top-k is smaller (3-5), because each result is already large
* 3. Has token budget control to avoid exceeding the LLM's context window
*/
async function retrieveSections(
query: string,
index: Section[], // Each section already has an embedding
config: LongRAGConfig = { topK: 5, maxContextTokens: 50000, minRelevanceScore: 0.3 },
): Promise<RetrievalResult> {
const queryEmbedding = await generateEmbedding(query);
// Compute similarity → filter → sort
const scored = index
.map((section) => ({
section,
score: cosineSimilarity(queryEmbedding, section.embedding),
}))
.filter((item) => item.score >= config.minRelevanceScore)
.sort((a, b) => b.score - a.score);
// Select top-k sections within token budget
const selected: Section[] = [];
let totalTokens = 0;
for (const item of scored) {
if (selected.length >= config.topK) break;
if (totalTokens + item.section.tokenCount > config.maxContextTokens) continue;
selected.push(item.section);
totalTokens += item.section.tokenCount;
}
// Sort by original position (preserve reading order)
selected.sort((a, b) => {
if (a.documentId !== b.documentId) return a.documentId.localeCompare(b.documentId);
return a.metadata.position - b.metadata.position;
});
return { sections: selected, totalTokens, query };
}
Complete Pipeline: Retrieval + Generation
/**
* Full LongRAG pipeline: split → index → retrieve → generate answer
*/
async function longRAGPipeline(
documents: { id: string; content: string }[],
query: string,
) {
// Step 1: Section splitting
const allSections = documents.flatMap((doc) => splitBySection(doc, 8000, 500));
// Step 2: Retrieval (assuming sections already have embeddings)
const retrieval = await retrieveSections(query, allSections, {
topK: 4, maxContextTokens: 40000, minRelevanceScore: 0.25,
});
// Step 3: Assemble context and generate answer
const context = retrieval.sections
.map((s, i) => `=== Source ${i + 1}: ${s.title} ===\n${s.content}`)
.join('\n---\n\n');
const answer = await callLLM({
model: 'claude-sonnet-4-20250514',
system: `Answer the question based on the reference materials. Cite source numbers, and clearly state when information is insufficient.`,
messages: [{ role: 'user', content: `Reference materials:\n${context}\n\nQuestion: ${query}` }],
maxTokens: 1000,
});
return {
answer,
sources: retrieval.sections.map((s) => ({ documentId: s.documentId, title: s.title })),
tokensUsed: retrieval.totalTokens,
};
}
Use Cases
LongRAG isn’t a silver bullet. Here are the scenarios where it particularly excels and where it doesn’t.
Particularly Well-Suited
1. Lengthy Legal Documents
Legal texts are characterized by extensive cross-references between clauses, where the meaning of one clause depends on the context of others. Traditional RAG breaks all these cross-references when it splits contracts into small chunks. LongRAG preserves entire sections or even entire contracts, letting the LLM see the complete relationships between clauses.
User asks: "Can the owner terminate the contract if the contractor delays delivery?"
Traditional RAG might only find:
"If delivery is delayed by more than 30 days, the owner has the right to terminate the contract."
LongRAG would find the entire "Termination Clause" section, including:
- Definition of delivery delay
- 30-day grace period
- Force majeure exceptions
- Written notice requirement before termination
- Post-termination settlement procedures
2. Academic Papers
A paper’s methodology, experimental design, and results analysis are typically spread across different sections but are closely related. LongRAG can retrieve the entire “Method + Experiments” segment at once, letting the LLM understand the causal relationship between methods and results.
3. Technical Manuals and API Documentation
Technical concepts typically require complete context to understand. An API endpoint’s behavior may depend on authentication settings, rate limit policies, error code definitions, and other information scattered across different sections. Large chunks make it easier to retrieve all this dispersed information at once.
4. Multi-Hop Reasoning Queries
When an answer requires synthesizing information from multiple paragraphs, LongRAG has a natural advantage:
Question: "Under what circumstances can the company not pay year-end bonuses?"
Requires synthesizing:
- Year-end bonus calculation rules (Chapter 4)
- Employee evaluation criteria (Chapter 7)
- Special exception clauses (Chapter 12)
LongRAG is more likely to retrieve all three chapters.
5. Scenarios Where Context Coherence Matters More Than Precision
Customer service knowledge bases, product FAQs, policy manuals — in these scenarios, providing users with a complete, coherent answer matters more than precisely citing a specific paragraph. LongRAG’s large context enables the LLM to generate smoother, more complete responses.
Less Suitable
1. Precise Fact Lookups in Very Large Corpora
If your corpus contains millions of documents and users are simply looking for a specific number or date, traditional RAG’s small chunks + precise retrieval is more efficient. LongRAG would consume a large amount of unnecessary tokens in this scenario.
2. Low-Latency Requirements
LongRAG feeds 5–10x more tokens to the LLM than traditional RAG, with inference latency increasing proportionally. For scenarios requiring millisecond-level responses (such as real-time search suggestions), this may be unacceptable.
3. Cost-Sensitive High-Frequency Queries
Consuming 15K–50K input tokens per query, with tens of thousands of queries per day, makes token costs very significant.
Limitations and Challenges
1. Requires Long-Context LLMs
LongRAG’s prerequisite is that the LLM can handle a large number of input tokens. If your model only has an 8K–16K context window, LongRAG’s large chunks simply won’t fit.
Currently, models supporting 100K+ context are still relatively few, and most are paid APIs (Claude, Gemini, GPT-4). Locally deployed open-source models are mostly still in the 8K–32K range.
2. Token Costs Scale Linearly
More input tokens directly means higher API costs. Simple calculation:
Assuming 10,000 queries per day:
Traditional RAG:
10,000 × 3,000 tokens × $3/1M = $90/day
LongRAG:
10,000 × 25,000 tokens × $3/1M = $750/day
Annualized difference: $32,850 vs $273,750
This cost gap is impossible to ignore at high traffic volumes.
3. Increased Inference Latency
LLM inference time is roughly proportional to input tokens. Processing 25K tokens takes approximately 5–8x longer than 3K tokens. In user-experience-sensitive scenarios (such as chatbots), this latency may be unacceptable.
Mitigation strategies:
- Use streaming responses so the time-to-first-token remains unchanged
- Cache results for popular queries
- Fall back to traditional RAG for simple queries that don’t need long context
4. Embedding Quality Degrades with Text Length
Existing embedding models (e.g., text-embedding-3-large) produce lower-quality semantic representations when processing long text. A 5,000-token passage may cover multiple topics, and a single embedding vector struggles to capture all of them simultaneously.
Mitigation strategies:
- Use summary embeddings (the approach described earlier)
- Use ColBERT-style multi-vector representations
- Use multiple embeddings to represent a single section
5. The “Lost in the Middle” Problem
Research shows that LLMs pay weaker attention to middle portions of long contexts (relative to the beginning and end). If critical information happens to be in the middle of a long context, the LLM may overlook it.
Mitigation strategies:
- Place the most relevant sections at the beginning and end of the context
- Explicitly remind the LLM in the prompt to attend to all sections
- Limit total context length — don’t stuff content in without limits
6. Lack of Standardized Evaluation Benchmarks
Traditional RAG has mature evaluation frameworks (Precision@K, Recall@K, MRR, etc.), but evaluating LongRAG is more difficult. Because retrieval unit sizes differ, directly comparing Precision@K is unfair. There are currently no standardized evaluation benchmarks for LongRAG scenarios.
Summary
LongRAG’s core insight is simple: When the LLM’s comprehension ability is strong enough and the context window is large enough, you don’t need to put all the pressure on retrieval.
Traditional RAG’s design was reasonable in the era of small context windows: precise splitting, precise retrieval, giving the LLM only the most essential information. But the cost of this strategy is information fragmentation and over-reliance on retrieval precision.
LongRAG redistributes this pressure:
- Retrieval: Shifts from “precisely find the most relevant small fragments” to “roughly find relevant large segments”
- Comprehension: Shifts from “piece together answers from fragmented context” to “understand and answer within complete context”
This isn’t meant to replace traditional RAG, but rather provides another effective design choice now that long-context models are widespread. Choose the most suitable strategy based on your document characteristics, query types, cost budget, and latency requirements.
The most pragmatic approach is likely a hybrid strategy: use traditional RAG for simple queries to save tokens, and switch to LongRAG for complex queries to improve quality. A single query classifier can make this happen.
Further Reading
- LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs — The original LongRAG paper
- Chunking Strategies: How Splitting Methods Determine Whether RAG Can Find Answers — Detailed comparison of traditional chunking strategies
- Contextual Retrieval: Giving Each Chunk Its Own Context — Anthropic’s alternative approach to solving fragmentation
- Cross-Encoder Reranking: Using Precision Ranking Models to Compensate for Coarse Retrieval — Remediation strategy when retrieval isn’t precise enough
References
- LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs — Jiang et al. (2024), the original LongRAG paper proposing the complete framework of large-chunk retrieval combined with long-context models
- GraphReader: Building Graph-based Agent to Enhance Long-Context Abilities of Large Language Models — Li et al. (2024, EMNLP), graph-structure agent system for processing long documents, surpassing GPT-4-128K with a 4K window
- Retrieval-Augmented Generation for Large Language Models: A Survey — Gao et al. (2024), comprehensive analysis of RAG evolution and design tradeoffs across generations
- Searching for Best Practices in Retrieval-Augmented Generation — Wang et al. (2024), systematic experiments on the relationship between chunking strategies and RAG performance
- Anthropic — Effective Context Engineering for AI Agents — Anthropic engineering blog on compaction and compression strategies for long-context management
- Multi-Head RAG: Solving Multi-Aspect Problems with LLMs — Besta et al. (2024), innovative approach using multi-head attention as retrieval keys, complementary to LongRAG
- Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG — Singh et al. (2025), analysis of Agentic RAG applications in long-document scenarios
Loading...