Skip to content
All tags

#latency-optimization

1 posts
ai guide

Speculative RAG: Small Models Draft in Parallel, Large Model Verifies at Once

Speculative RAG uses small specialist models to generate multiple answer drafts from different document subsets in parallel, then a large model verifies and selects the best answer in one pass. Accuracy improves up to 12.97%, latency drops up to 50.83%.