Speculative RAG: Small Models Draft in Parallel, Large Model Verifies at Once

TL;DR Speculative RAG uses small specialist models to generate multiple answer drafts from different document subsets in parallel, then a large model verifies and selects the best answer in one pass. Accuracy improves up to 12.97%, latency drops up to 50.83%.

#rag #speculative-rag #dual-model #latency-optimization #accuracy

Table of Contents

Three Bottlenecks of Standard RAG
Speculative RAG Architecture
1. Overall Flow
2. Why Is This Faster?
RAG Drafter Design
RAG Verifier Design
Performance Data
Comparison with Other RAG Patterns
Implementation Guide
Applicable Scenarios and Limitations
Future Outlook
References

🌏 中文版

The standard RAG pipeline is familiar to everyone: retrieve documents, assemble them into context, and send them to an LLM for generation. This pipeline is simple and effective, but it has a fundamental bottleneck: all documents are stuffed into a single LLM call, the model must process all information in one generation pass, and the entire pipeline is sequential.

The more documents, the longer the context, the higher the latency — and models tend to “get lost” in super-long contexts, with important information buried in massive walls of text. This is the problem Speculative RAG aims to solve.

Three Bottlenecks of Standard RAG

1. Latency from Sequential Processing

Standard RAG follows a strictly sequential pipeline:

Query → Retrieve → [Concatenate all docs] → LLM generation → Answer
                                              ↑
                                    Single call, long wait

Document retrieval can be fast (millisecond-level), but LLM generation is the bottleneck of the entire pipeline. The longer the context, the longer the generation time. Ten documents concatenated together could easily reach 8,000 tokens, and a large model takes significant time to process that length.

2. Attention Dilution in Long Context

When all retrieved documents are packed into a single prompt, the model must simultaneously process multiple pieces of potentially contradictory information. Research has repeatedly shown that LLMs suffer from a “lost in the middle” problem in long contexts: information at the beginning and end is remembered, while the middle is ignored.

If the most relevant document happens to be ranked in the middle, the model may not fully utilize it at all.

3. The Single-Generation Gamble

Standard RAG generates only one answer. If the generation goes off course — picking up wrong information from documents, or following a flawed reasoning chain — there’s no chance for correction. Unless you add an Agentic RAG loop-retry mechanism, but that introduces even more latency.

Speculative RAG’s core insight: Rather than having one large model struggle through all documents, have multiple small models each handle a small subset, then let a large model pick the best answer from multiple candidates.

Speculative RAG Architecture

The name borrows from the concept of Speculative Decoding: use small models to “speculate,” use the large model to “verify.”

Overall Flow

                          ┌─────────────────┐
                          │     Query        │
                          └────────┬────────┘
                                   │
                          ┌────────▼────────┐
                          │    Retriever     │
                          │ (Retrieve N docs)│
                          └────────┬────────┘
                                   │
                    ┌──────────────┼──────────────┐
                    │              │              │
              ┌─────▼─────┐ ┌─────▼─────┐ ┌─────▼─────┐
              │ Subset 1  │ │ Subset 2  │ │ Subset 3  │
              │ {D1, D3}  │ │ {D2, D5}  │ │ {D1, D4}  │
              └─────┬─────┘ └─────┬─────┘ └─────┬─────┘
                    │              │              │
              ┌─────▼─────┐ ┌─────▼─────┐ ┌─────▼─────┐
              │  Drafter   │ │  Drafter   │ │  Drafter   │
              │(Small Model)│ │(Small Model)│ │(Small Model)│
              │  Draft 1   │ │  Draft 2   │ │  Draft 3   │
              │+ Rationale │ │+ Rationale │ │+ Rationale │
              └─────┬─────┘ └─────┬─────┘ └─────┬─────┘
                    │              │              │
                    │      ┌──────┴──────┐       │
                    └──────►             ◄───────┘
                           │  Verifier   │
                           │(Large Model)│
                           │Score+Select │
                           └──────┬──────┘
                                  │
                          ┌───────▼───────┐
                          │  Best Answer   │
                          └───────────────┘

Three key steps:

Document grouping: Split the N retrieved documents into K subsets (randomly or strategically), each containing a portion of the documents
Parallel draft generation: K small RAG Drafter models process their respective document subsets in parallel, each generating an answer draft and reasoning process
Single-pass verification: One large RAG Verifier model receives all drafts, scores them in one pass, and selects the best answer

Why Is This Faster?

The key is parallelization. Three small models run simultaneously, each processing only 2–3 documents (short context), generating quickly. The large model only needs to review a few drafts (not all raw documents), so the context is short and verification is fast.

Overall latency ≈ max(Drafter latency) + Verifier latency

Compared to standard RAG:

Overall latency ≈ Large model processing all documents latency

The former is typically significantly shorter because the small models are fast, the context is short, and execution is parallel.

RAG Drafter Design

The RAG Drafter is the “workforce” of the architecture — small, specialized, and parallelizable.

Model Selection

The paper uses Mistral-7B-Instruct as the Drafter model, with specialized training for RAG tasks. Reasons for choosing a small model:

Fast inference: A 7B-parameter model has much lower inference latency on GPUs compared to 70B+ large models
Parallel deployment: The same GPU memory can run multiple small model instances
Task specialization: Drafters don’t need broad world knowledge — they only need to extract and organize information from given documents

Each Drafter Sees a Different Document Subset

This is the most clever design of Speculative RAG. Suppose the retriever returns 6 documents {D1, D2, D3, D4, D5, D6}, the system creates multiple subsets:

Drafter 1 receives: {D1, D3, D5}
Drafter 2 receives: {D2, D4, D6}
Drafter 3 receives: {D1, D2, D4}
Drafter 4 receives: {D3, D5, D6}

Each Drafter sees only a portion of the documents, which brings several benefits:

Shorter context per Drafter: 3 documents vs. 6 documents means more focused attention
Diversity from different subsets: Different document combinations may lead to different answer perspectives
Redundancy for fault tolerance: Even if one subset’s document quality is poor, other subsets may still contain the information needed for the correct answer

Subset Sampling Strategy

The paper’s sampling approach: for each Drafter, randomly sample a subset from the N retrieved documents. Subsets can overlap (the same document may appear in multiple subsets), which increases the chance that important documents get utilized.

Subset size is a hyperparameter. Too small (1 document) may lack sufficient information; too large (close to N documents) loses the advantage of distributed processing. The paper’s experiments typically use 2–3 documents as the subset size.

Rationale Generation

Each Drafter generates not just an answer but also a rationale (reasoning process). The Drafter’s output format:

Draft: [Answer content]
Rationale: [Reasoning process for arriving at this answer from these documents]

The rationale serves as a basis for the Verifier’s judgment. The Verifier doesn’t just evaluate whether the answer is correct — it also assesses whether the reasoning is sound. One answer might happen to be correct but have flawed reasoning, while another might be less complete but have rigorous logic — the Verifier can make better judgments based on this.

Drafter Training

The paper trains the Drafter using Knowledge Distillation:

Use a large model (e.g., GPT-4) to generate high-quality (query, documents, draft, rationale) training data for RAG tasks
Fine-tune a small model (Mistral-7B-Instruct) on this data
The fine-tuned small model learns the ability to “read documents → write draft + reasoning”

This training approach allows a 7B small model to approach large model quality on specific tasks while maintaining the speed advantage of a small model.

RAG Verifier Design

The RAG Verifier is the “judge” of the architecture — large, general-purpose, and decisive in a single pass.

Model Selection

The paper uses GPT-4 or similar large general-purpose models as the Verifier. Reasons for choosing a large model:

Broad world knowledge: Can cross-verify draft accuracy
Strong reasoning ability: Can evaluate logical consistency of reasoning chains
Comparative judgment: Can review multiple candidate answers simultaneously and select the best one

Verification Flow

The Verifier receives the following input:

Query: [Original question]

Draft 1:
Answer: [Draft 1's answer]
Rationale: [Draft 1's reasoning process]

Draft 2:
Answer: [Draft 2's answer]
Rationale: [Draft 2's reasoning process]

Draft 3:
Answer: [Draft 3's answer]
Rationale: [Draft 3's reasoning process]

Please evaluate each draft and select the best answer.

Note that the Verifier does not see the original documents. It only sees the Drafters’ drafts and reasoning processes. This is an intentional design choice:

Shorter context: 3 drafts are far shorter than 6 raw documents
Pre-processed information: Drafters have already extracted relevant information from the raw documents
Focused comparison: The Verifier’s task is to compare and judge, not extract information from scratch

Scoring Mechanism

The Verifier scores each draft across multiple dimensions:

Factual Accuracy: Is the answer consistent with the information cited in the reasoning?
Reasoning Completeness: Is the reasoning chain complete? Are there logical leaps?
Query Relevance: Does the answer directly address the question?
Self-Consistency: Are the answer and rationale consistent with each other?

Ultimately, the Verifier outputs an overall score or directly selects the best draft. In the paper’s implementation, conditional probability is used for scoring:

Score(draft_k) = P_verifier(draft_k | query, all_drafts)

That is, given all drafts, the probability of the Verifier model generating each draft’s tokens. Higher probability means the Verifier “agrees” more with that draft.

Why Doesn’t the Verifier Need the Original Documents?

This is a counterintuitive design choice. We might wonder: how can the Verifier judge whether a draft is accurate without seeing the original documents?

The answer is: the Verifier relies on the large model’s own world knowledge and reasoning ability.

If a draft claims “the highest mountain in Taiwan is Yushan at 3,952 meters elevation,” the Verifier’s large model already knows whether this is correct
If a draft’s reasoning has logical contradictions (stating A earlier then concluding not A later), the Verifier can detect this through reasoning ability
If multiple drafts give different answers, the Verifier can cross-reference to determine which is more credible

This design keeps the Verifier’s context window short, further reducing latency.

Performance Data

The paper tested Speculative RAG on five benchmarks, comparing it with standard RAG and other methods. Here are the main results.

Accuracy Comparison

Benchmark	Standard RAG	Self-RAG	CRAG	Speculative RAG	vs Standard
TriviaQA	68.27%	70.41%	69.83%	73.59%	+5.32%
MuSiQue	25.14%	27.30%	26.71%	30.48%	+5.34%
PopQA	43.87%	49.06%	47.22%	56.84%	+12.97%
PubHealth	72.40%	74.10%	73.50%	76.20%	+3.80%
ARC-Challenge	78.95%	81.23%	80.56%	84.17%	+5.22%

Key observations:

PopQA shows the largest improvement (+12.97%): PopQA is a long-tail knowledge QA benchmark with many questions about uncommon entities. Speculative RAG’s multi-subset strategy gives different Drafters the opportunity to find answers from different angles
MuSiQue also shows significant improvement (+5.34%): MuSiQue is a multi-hop reasoning task requiring the combination of information from multiple documents. Different Drafters seeing different document combinations increases the chance of “accidentally” combining the right information
Outperforms Self-RAG and CRAG across the board: Speculative RAG beats both methods on all benchmarks

Latency Comparison

Benchmark	Standard RAG Latency	Speculative RAG Latency	Latency Reduction
TriviaQA	12.4s	6.1s	-50.81%
MuSiQue	14.8s	7.8s	-47.30%
PopQA	11.2s	5.5s	-50.89%
PubHealth	13.6s	7.2s	-47.06%
ARC-Challenge	10.8s	5.8s	-46.30%

Sources of latency reduction:

Parallel Drafter execution: Multiple small models run simultaneously — total latency depends on the slowest one (not the sum)
Shorter context: Each Drafter processes only 2–3 documents; the Verifier processes only a few drafts
Fewer large model calls: Standard RAG has the large model process all documents; Speculative RAG only uses the large model for verification (shorter context)

Pareto Improvement in Accuracy vs. Latency

This is Speculative RAG’s strongest point: it simultaneously improves both accuracy and latency.

Typically, we face this tradeoff:

Want more accuracy? Use a larger model, process more documents → higher latency
Want more speed? Use a smaller model, process fewer documents → lower accuracy

Speculative RAG breaks this tradeoff. Through architectural design (rather than simply throwing hardware at the problem), it achieves improvements in both dimensions. This is a Pareto Improvement.

Comparison with Other RAG Patterns

vs Standard RAG

Standard RAG:
  Query → Retrieve ALL docs → [Big Model] → Answer
  Latency: High (large model processes long context)
  Accuracy: Medium (attention dilution)

Speculative RAG:
  Query → Retrieve docs → Split into subsets
        → [Small Model 1] → Draft 1 ─┐
        → [Small Model 2] → Draft 2 ──┤→ [Big Model] → Best Answer
        → [Small Model 3] → Draft 3 ─┘
  Latency: Low (parallel + short context)
  Accuracy: High (multiple perspectives + verification)

Core difference: Standard RAG is “one model does everything,” Speculative RAG is “division of labor.”

vs Self-RAG

Self-RAG has the model self-reflect during generation, deciding whether more retrieval is needed. Its problems:

Sequential reflection: Reflect-retrieve-regenerate is a sequential process; each reflection adds latency
One model wearing multiple hats: The same model handles both generation and reflection — task conflict
Single perspective: Always starting from the same document set

Speculative RAG’s advantages:

Parallel, not sequential: Multiple Drafters work simultaneously
Dedicated roles: Drafters focus on generation, Verifier focuses on verification
Diverse perspectives: Different document subsets bring different angles

Dimension	Self-RAG	Speculative RAG
Architecture	Single model + reflection tokens	Dual model (Drafter + Verifier)
Execution	Sequential (generate → reflect → regenerate)	Parallel (multi-Drafter) + single verification
Latency	High (multiple iterations)	Low (parallel + short context)
Diversity	Low (same perspective for reflection)	High (different document subsets)
Training Cost	Requires special token training	Requires Drafter distillation training

vs CRAG (Corrective RAG)

CRAG’s core approach is “retrieval quality detection + correction”: if retrieval results are poor, correct the query and re-retrieve.

The two solve different problems:

CRAG addresses “poor retrieval quality”: Retrieved documents are irrelevant, requiring query correction
Speculative RAG addresses “poor generation quality”: Documents have already been retrieved; the issue is how to better utilize them

They are actually complementary. You can use CRAG first to ensure retrieval quality, then Speculative RAG to ensure generation quality:

Query → CRAG (ensure retrieval quality) → Speculative RAG (ensure generation quality) → Answer

Dimension	CRAG	Speculative RAG
Goal	Improve retrieval quality	Improve generation quality
Correction target	Query / retrieval results	Answer drafts
Additional cost	Multiple retrievals	Multiple Drafter inferences
Model requirement	Single model	Dual model (large + small)
Composability	Can combine with Speculative RAG	Can combine with CRAG

Applicability Overview of Each Method

                    Retrieval Quality
                    ↑
          High ┃  Standard RAG       Speculative RAG
               ┃  (Good enough)      (Need more accuracy+speed)
               ┃
          Low  ┃  CRAG               CRAG + Speculative RAG
               ┃  (Fix retrieval     (Need to fix both)
               ┃   first)
               ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━→
                    Low              High
                       Generation Complexity

Implementation Guide

Below is a TypeScript implementation example demonstrating the Speculative RAG dual-model pattern.

Core Type Definitions

interface Document {
  id: string;
  content: string;
  score: number; // Retrieval relevance score
}

interface Draft {
  answer: string;
  rationale: string;
  sourceDocIds: string[];
  drafterId: number;
}

interface VerificationResult {
  selectedDraft: Draft;
  scores: Map<number, number>; // drafterId → score
  confidence: number;
}

interface SpeculativeRAGConfig {
  numDrafters: number;        // Number of Drafters (default 3-5)
  subsetSize: number;         // Documents per subset (default 2-3)
  drafterModel: string;       // Small model ID
  verifierModel: string;      // Large model ID
  maxDrafterTokens: number;   // Max Drafter output tokens
  maxVerifierTokens: number;  // Max Verifier output tokens
}

Document Subset Sampling

function sampleDocumentSubsets(
  documents: Document[],
  numSubsets: number,
  subsetSize: number,
): Document[][] {
  const subsets: Document[][] = [];

  for (let i = 0; i < numSubsets; i++) {
    // Weighted random sampling: higher relevance scores increase selection probability
    const subset = weightedSample(documents, subsetSize);
    subsets.push(subset);
  }

  return subsets;
}

function weightedSample(
  documents: Document[],
  size: number,
): Document[] {
  const totalScore = documents.reduce((sum, doc) => sum + doc.score, 0);
  const selected: Document[] = [];
  const remaining = [...documents];

  for (let i = 0; i < Math.min(size, remaining.length); i++) {
    // Weighted sampling based on retrieval scores
    const weights = remaining.map((doc) => doc.score / totalScore);
    const idx = weightedRandomIndex(weights);
    selected.push(remaining[idx]);
    remaining.splice(idx, 1);
  }

  return selected;
}

function weightedRandomIndex(weights: number[]): number {
  const r = Math.random();
  let cumulative = 0;
  for (let i = 0; i < weights.length; i++) {
    cumulative += weights[i];
    if (r <= cumulative) return i;
  }
  return weights.length - 1;
}

RAG Drafter Implementation

async function generateDraft(
  query: string,
  documents: Document[],
  drafterId: number,
  config: SpeculativeRAGConfig,
): Promise<Draft> {
  const docContext = documents
    .map((doc, i) => `[Document ${i + 1}] (ID: ${doc.id})\n${doc.content}`)
    .join('\n\n');

  const prompt = `You are a RAG specialist. Given the following documents, answer the query.
You MUST also provide your reasoning process (rationale).

Query: ${query}

Documents:
${docContext}

Respond in this exact format:
Answer: [Your answer based on the documents]
Rationale: [Step-by-step reasoning for how you arrived at this answer from the documents]`;

  const response = await callLLM({
    model: config.drafterModel,
    prompt,
    maxTokens: config.maxDrafterTokens,
    temperature: 0.7, // Slightly higher temperature for diversity
  });

  const { answer, rationale } = parseDraftResponse(response);

  return {
    answer,
    rationale,
    sourceDocIds: documents.map((d) => d.id),
    drafterId,
  };
}

function parseDraftResponse(
  response: string,
): { answer: string; rationale: string } {
  const answerMatch = response.match(/Answer:\s*([\s\S]*?)(?=Rationale:)/i);
  const rationaleMatch = response.match(/Rationale:\s*([\s\S]*)/i);

  return {
    answer: answerMatch?.[1]?.trim() ?? response,
    rationale: rationaleMatch?.[1]?.trim() ?? 'No rationale provided',
  };
}

RAG Verifier Implementation

async function verifyDrafts(
  query: string,
  drafts: Draft[],
  config: SpeculativeRAGConfig,
): Promise<VerificationResult> {
  const draftsContext = drafts
    .map(
      (draft, i) =>
        `[Draft ${i + 1}] (Drafter #${draft.drafterId})
Answer: ${draft.answer}
Rationale: ${draft.rationale}
Source Documents: ${draft.sourceDocIds.join(', ')}`,
    )
    .join('\n\n---\n\n');

  const prompt = `You are an expert answer verifier. Given a query and multiple draft answers,
evaluate each draft and select the best one.

Evaluation criteria:
1. Factual Accuracy: Is the answer factually correct?
2. Reasoning Quality: Is the rationale logical and complete?
3. Query Relevance: Does the answer directly address the query?
4. Self-Consistency: Are the answer and rationale consistent?

Query: ${query}

Drafts:
${draftsContext}

Respond in this exact format:
Selected: [draft number]
Confidence: [0.0-1.0]
Scores: [draft1_score, draft2_score, ...]
Justification: [Why you selected this draft]`;

  const response = await callLLM({
    model: config.verifierModel,
    prompt,
    maxTokens: config.maxVerifierTokens,
    temperature: 0.0, // Low temperature for verification to ensure consistency
  });

  return parseVerificationResponse(response, drafts);
}

function parseVerificationResponse(
  response: string,
  drafts: Draft[],
): VerificationResult {
  const selectedMatch = response.match(/Selected:\s*(\d+)/i);
  const confidenceMatch = response.match(/Confidence:\s*([\d.]+)/i);
  const scoresMatch = response.match(/Scores:\s*\[([\d.,\s]+)\]/i);

  const selectedIdx = (parseInt(selectedMatch?.[1] ?? '1') - 1);
  const confidence = parseFloat(confidenceMatch?.[1] ?? '0.5');
  const scoreValues = scoresMatch?.[1]?.split(',').map((s) => parseFloat(s.trim())) ?? [];

  const scores = new Map<number, number>();
  drafts.forEach((draft, i) => {
    scores.set(draft.drafterId, scoreValues[i] ?? 0);
  });

  return {
    selectedDraft: drafts[selectedIdx] ?? drafts[0],
    scores,
    confidence,
  };
}

Complete Pipeline

async function speculativeRAG(
  query: string,
  config: SpeculativeRAGConfig,
): Promise<{
  answer: string;
  confidence: number;
  selectedDrafterId: number;
  allDrafts: Draft[];
  verification: VerificationResult;
}> {
  // Step 1: Retrieve documents
  const documents = await retrieve(query);

  // Step 2: Create document subsets
  const subsets = sampleDocumentSubsets(
    documents,
    config.numDrafters,
    config.subsetSize,
  );

  // Step 3: Generate drafts in parallel (critical!)
  const draftPromises = subsets.map((subset, i) =>
    generateDraft(query, subset, i, config),
  );
  const drafts = await Promise.all(draftPromises);

  // Step 4: Large model verification
  const verification = await verifyDrafts(query, drafts, config);

  return {
    answer: verification.selectedDraft.answer,
    confidence: verification.confidence,
    selectedDrafterId: verification.selectedDraft.drafterId,
    allDrafts: drafts,
    verification,
  };
}

Usage Example

const config: SpeculativeRAGConfig = {
  numDrafters: 4,
  subsetSize: 2,
  drafterModel: 'mistral-7b-instruct',   // Small model
  verifierModel: 'gpt-4',                 // Large model
  maxDrafterTokens: 512,
  maxVerifierTokens: 256,
};

const result = await speculativeRAG(
  'Which crag in Taiwan is best for beginner climbers?',
  config,
);

console.log(`Answer: ${result.answer}`);
console.log(`Confidence: ${result.confidence}`);
console.log(`Selected Drafter: #${result.selectedDrafterId}`);
console.log(`All drafts: ${result.allDrafts.length}`);

Implementation Notes

Drafter Model Selection

You don’t have to use the paper’s Mistral-7B. Any small model that supports instruction following works:

Mistral 7B / Mixtral 8x7B
Llama 3.1 8B
Phi-3 Mini (3.8B)
Gemma 2 9B

The key is that the model must be small enough for parallel deployment yet smart enough to extract information from documents.

Verifier Model Selection

Large general-purpose models work best:

GPT-4 / GPT-4o
Claude 3.5 Sonnet / Claude 3 Opus
Gemini 1.5 Pro

The Verifier needs broad knowledge and strong judgment, not speed.

Hyperparameter Tuning

// Conservative config (low latency, suitable for simple questions)
const conservativeConfig: SpeculativeRAGConfig = {
  numDrafters: 3,
  subsetSize: 2,
  drafterModel: 'phi-3-mini',
  verifierModel: 'gpt-4o-mini',
  maxDrafterTokens: 256,
  maxVerifierTokens: 128,
};

// Aggressive config (high accuracy, suitable for complex questions)
const aggressiveConfig: SpeculativeRAGConfig = {
  numDrafters: 5,
  subsetSize: 3,
  drafterModel: 'mistral-7b-instruct',
  verifierModel: 'gpt-4',
  maxDrafterTokens: 1024,
  maxVerifierTokens: 512,
};

Applicable Scenarios and Limitations

Scenarios Where Speculative RAG Shines

1. Latency-Sensitive Knowledge QA

If your application has strict response time requirements (e.g., customer service chatbots, real-time search engines), Speculative RAG can significantly reduce latency without sacrificing accuracy.

2. Large and Diverse Document Collections

When the retriever returns many documents (10+) covering different aspects, Speculative RAG’s subset distribution strategy is particularly effective. Different Drafters seeing different subsets makes it easier to capture different facets of information.

3. Scenarios Requiring High Accuracy

Medical QA (PubHealth benchmark), scientific reasoning (ARC-Challenge), and other scenarios requiring high accuracy. The diversity of multiple Drafters plus the Verifier’s strict validation is more reliable than single-pass generation.

4. GPU Resources Available for Parallel Inference

Speculative RAG needs to run multiple Drafter instances simultaneously. If you have sufficient GPU resources (or use APIs that support batch inference), this architecture can leverage the parallelization advantage.

Scenarios Where It’s Not a Good Fit

1. Simple Factual Queries

For questions like “What is the capital of Taiwan?” standard RAG or even just letting the LLM answer directly works fine. Multi-Drafter verification for such questions is over-engineering and wastes resources.

2. Limited GPU Resources

If you have only one GPU or your API doesn’t support batch/concurrent calls, the parallelization advantage of Drafters is lost. Running 4 Drafters + 1 Verifier sequentially would actually have higher latency than standard RAG.

3. Consistently High-Quality Documents

If your knowledge base is high quality and documents don’t contradict each other, standard RAG’s single-pass generation is usually good enough. Speculative RAG’s diversity advantage is not significant in this case.

4. Scenarios Requiring Real-Time Streaming

Speculative RAG requires waiting for all Drafters to finish plus Verifier validation before outputting. If your application needs token-by-token streaming (e.g., ChatGPT-style progressive display), this architecture requires additional modifications.

One possible streaming approach: stream the highest-confidence Drafter’s draft first while running the Verifier in the background. If the Verifier selects a different draft, swap the display. But this adds UX complexity.

5. Very Few Documents

If only 2–3 documents are retrieved, splitting them into multiple subsets makes little sense (each subset might have only 1 document). In this case, standard RAG processing directly is sufficient.

Cost Considerations

Speculative RAG has a different cost structure from standard RAG:

Item	Standard RAG	Speculative RAG
Large model calls	1 (long context)	1 (short context)
Small model calls	0	K (parallel)
Large model input tokens	Many (all documents)	Few (only drafts)
Small model input tokens	0	K x subset document tokens
Total token cost	Medium	Medium-high
GPU requirement	Low	Medium-high

In terms of token cost, Speculative RAG may be slightly higher (due to K additional Drafter calls), but the large model’s input tokens are reduced (drafts vs. raw documents). If the price gap between large and small models is significant (e.g., GPT-4 vs. Mistral-7B), the total cost may break even or even be lower.

In terms of latency cost, Speculative RAG is clearly lower, which is highly valuable in latency-sensitive scenarios.

Future Outlook

Speculative RAG was published in July 2024 and accepted by ICLR in 2025. The architecture’s core idea — division of labor and parallelization — is likely to be more broadly applied in other LLM pipelines.

Several possible directions:

Adaptive Drafter count: Dynamically adjust the number of Drafters based on question complexity. Use 2 for simple questions, 5 for complex ones.
Intelligent subset allocation: Instead of random document subset assignment, use strategic grouping based on document topic or type.
Drafter specialization: Different Drafters specialize in different question types (factual, reasoning, comparative), with routing based on question type.
Combination with other RAG techniques: CRAG + Speculative RAG, Graph RAG + Speculative RAG, and other combinations.

References

Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting (Wang et al., ICLR 2025)
Fast Inference from Transformers via Speculative Decoding (Leviathan et al., ICML 2023) — The original Speculative Decoding paper, conceptual source for Speculative RAG’s dual-model design
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection (Asai et al., ICLR 2024) — Self-RAG comparison method
Corrective Retrieval Augmented Generation (Yan et al., AAAI 2024) — CRAG comparison method, complementary to Speculative RAG
A Survey on Retrieval-Augmented Generation for Large Language Models (2024) — Full landscape of RAG systems, covering the background of the Drafter-Verifier dual-model architecture

Speculative RAG: Small Models Draft in Parallel, Large Model Verifies at Once

Three Bottlenecks of Standard RAG

1. Latency from Sequential Processing

2. Attention Dilution in Long Context

3. The Single-Generation Gamble

Speculative RAG Architecture

Overall Flow

Why Is This Faster?

RAG Drafter Design

Model Selection

Each Drafter Sees a Different Document Subset

Subset Sampling Strategy

Rationale Generation

Drafter Training

RAG Verifier Design

Model Selection

Verification Flow

Scoring Mechanism

Why Doesn’t the Verifier Need the Original Documents?

Performance Data

Accuracy Comparison

Latency Comparison

Pareto Improvement in Accuracy vs. Latency

Comparison with Other RAG Patterns

vs Standard RAG

vs Self-RAG

vs CRAG (Corrective RAG)

Applicability Overview of Each Method

Implementation Guide

Core Type Definitions

Document Subset Sampling

RAG Drafter Implementation

RAG Verifier Implementation

Complete Pipeline

Usage Example

Implementation Notes

Applicable Scenarios and Limitations

Scenarios Where Speculative RAG Shines

Scenarios Where It’s Not a Good Fit

Cost Considerations

Future Outlook

References

Related · #rag