Skip to content
All tags

#rag

66 posts
ai deep-dive

How to Rigorously Compare Before and After Agent Changes: From Golden Sets to Statistical Testing

Even with temperature=0, LLM outputs can still fluctuate by up to 15% in practice. To rigorously compare agent changes, you need a frozen golden set, at least 3 runs per query averaged out, LLM-as-judge blind evaluation (pairwise preference flip rate reaches 35%), and paired statistical tests -- not just running each version once and going by feel.

ai deep-dive

How Agents Decide Whether to Retrieve, What to Retrieve, and How to Merge: Three Decision Layers of Agentic RAG

Traditional RAG is a fixed pipeline of 'retrieve then answer.' Agentic RAG splits retrieval into three decision layers: when to retrieve (FLARE uses token probabilities; Adaptive-RAG uses a complexity classifier), what to retrieve (HyDE / RAG-Fusion / decomposition / Step-back), and how to fuse (RRF k=60 then cross-encoder rerank then compression -- Anthropic measured a -67% failure rate reduction). Key counter-intuitive insight: unnecessary retrieval hurts quality -- 'deciding not to retrieve' is a first-class capability.

ai deep-dive

Semantic Similarity ≠ Retrieval Relevance: Scenarios, Detection, and Remedies for Systematic Embedding Retrieval Failures

Cosine similarity and relevance systematically diverge across an entire class of scenarios: negation (most IR models score at or below random on NevIR), exact identifiers, numeric thresholds, and logical combinations (SoTA models achieve recall@100 < 20 on LIMIT) -- some of these hit the theoretical ceiling of the single-vector paradigm, and switching to a larger model will not help. Recommended remedy order: hybrid BM25 -> reranker (Anthropic measured -67%) -> upstream metadata routing -> domain fine-tuning / multi-vector.

ai deep-dive

A More Expensive Embedding Won't Save Your Traditional Chinese RAG: Three Layers of Failure and the Fix Order

Traditional Chinese RAG retrieval failures are a three-layer stack: embedding granularity defects (BGE/GTE from 0.1B to 7B all mis-rank on simple queries like 'fried chicken'), Simplified Chinese / English corpus dominance causing local vocabulary drift ('premium', 'exclusion clause' alignment is unreliable), and MTEB Chinese benchmarks being Simplified Chinese making model selection signals misleading. The fix is architectural: OpenCC normalization -> hybrid + jieba segmentation -> reranker -> local fine-tuning last -- and the prerequisite for all of it is building a Traditional Chinese eval set first.

ai deep-dive

Auto-Embedding on File Upload Is a Bad Default: A Survey of Adaptive / Agentic RAG and Agentic Parsing

Making 'chunk and embed every uploaded file automatically' the default behavior means making a decision for the LLM that it could have made itself. From Self-RAG (2310.11511) and Adaptive-RAG (2403.14403) to AgenticOCR (2602.24134), the academic trajectory is pushing three layers of decision-making -- whether to retrieve, whether to parse, and how to chunk -- from the ingestion pipeline back to the agent at conversation time.

tech debug

LLM Agent Tool Descriptions Determine Tool Selection: Three Bug Fixes

Rewriting tool descriptions from soft suggestions to hard rules (whitelist + consequence explanation) eliminated the LLM's incorrect tool selection; adding skip_signal=True fixed vector store double-indexing.

ai

Claude for Financial Services: Dissecting Anthropic's Multi-Agent Reference Implementation

Anthropic open-sourced 12 financial-industry Agents and 11 MCP connectors. The real takeaway isn't the Agents themselves but the layered design of 'one prompt, two runtimes' and 'pure-file extensibility.'

ai deep-dive

DeepSeek-OCR: The 10x Compression Experiment That Turns Long Context into Images

DeepSeek-OCR's paper is titled Contexts Optical Compression -- OCR is just the means; what it actually validates is that 'rendering text as images and feeding them to a VLM' achieves 10x compression at 97% accuracy. This is a qualitative shift for long-context LLM and RAG token costs.

ai

Local Deep Research Walkthrough: A Privacy-First Deep Research Agent

Local Deep Research is a privacy-first deep research agent built on LangChain + LangGraph, integrating 20+ search engines and 30+ research strategies. Its flagship langgraph_agent_strategy takes the LLM-autonomous tool-calling approach, offering a fundamentally different paradigm from fixed-pipeline RAG graphs.

PageIndex: RAG Without Vectors — Turning Long Documents Into a Book With a Table of Contents

PageIndex skips chunking, embedding, and vector storage entirely. Instead it relies on LLM reasoning over a tree-structured table of contents the LLM itself wrote, achieving 98.7% on FinanceBench (GPT-4o reading directly scores only 31%). It solves a different problem than vector RAG — finding the right section in a well-structured long document.

ai guide RAG 系統實戰

Building a Legal Contract RAG in 36 Hours: Weaviate Query Agent + ColQwen Architecture Breakdown

Using Weaviate Query Agent + ColQwen multi-vector model, a single prompt built a production-grade legal contract search system in 36 hours -- this post breaks down its architecture logic, technology choices, and what you actually need to watch out for.

ai guide

knowledge-pipeline: A Six-Layer Pipeline for RAG Quality Control

A six-layer deterministic pipeline that handles everything from URL ingestion to vector embedding automatically, filtering out garbage before it enters your RAG system through an eight-dimension scoring system.

ai guide

MarkItDown: Convert Any File to Markdown Before Feeding It to an LLM

A lightweight open-source tool from Microsoft that converts PDF, Office, images, audio, and more into Markdown — purpose-built for LLM pipelines.

product project

quidproquo Blog Improvement Roadmap: Content, Technical Debt, RAG Design, and Harness Infrastructure

Using my own 30+ RAG/Agent posts to audit the blog itself, I identified a prioritized improvement list spanning content quality, site tech, RAG design fixes, harness infrastructure, and AI agent applications — no phases, just priorities.

tech guide

The Full Picture of Cloudflare Workers AI Binding: It's More Than Just run()

env.AI is not just run(). It also exposes toMarkdown (document-to-Markdown conversion), autorag (managed RAG), gateway (external provider proxy), and models (metadata lookup). Understanding these four method groups is what unlocks Cloudflare as a full AI platform inside Workers.

ai guide

Three Modes of LLM Knowledge Bases: Knowledge Vault, Experience Vault, and Blog

Andrej Karpathy proposed a framework for compiling personal knowledge wikis with LLMs — collect raw data, have the LLM compile it into .md wiki pages, run Q&A against the wiki, and file outputs back. This post compares three practical approaches: Karpathy's knowledge vault model, the community's experience vault model, and quidproquo's blog model.

ai guide

AI-Ready Content: The Complete Guide to Making Your Website an AI-Readable Data Source

In 2025-2026, websites need to be readable not just by humans but by AI. From llms.txt and Schema Markup to GEO and RAG ingestion pipelines, this post maps out the complete technical landscape for turning your website into an AI-consumable data source.

tech guide

"Recommend the next route" and "Recommend something similar" are not the same thing — Intent Disambiguation in RAG Recommendation Systems

In a climbing RAG system, 'recommend the next route' (progression) and 'recommend a similar route' (similarity) were conflated by a single hasSimilarRouteIntent() function, causing recommendation quality to collapse. The fix is a two-stage intent classification with a Regex Fast Path + LLM Fallback.

tech guide

RAG Multi-Entity Queries: When the User Lists Five Routes and the System Only Sees the First

The RAG system's extractRouteReference() used a for...return pattern that grabbed only the first match — so when a user provided five completed routes, only one was used. The fix evolves through three layers: rule-based multi-entity extraction, user profile aggregation, and embedding centroid.

tech deep-dive

When Vector Search Matches by Name Instead of Grade: Attribute Conflation in RAG Systems

Query: 'I just sent Beauty in the Mirror 5.11b — recommend routes of similar difficulty.' The results came back full of routes with similar-sounding names, not similar grades. Root cause: dense embeddings compress multiple attributes into a single vector, and the rarity of the route name drowns out the grade signal. The fix: three layers of defense — metadata pre-filtering, query rewriting, and score fusion.

ai guide

LangGraph: Managing Agent Workflows with Graph Structures

LangGraph models LLM workflows as directed graphs, solving the pain points of multi-turn iteration, conditional branching, and parallel execution that are difficult to handle with linear pipelines.

ai guide AI Agent 實戰

Context Engineering: Why Your AI Agent's Problem Is Information, Not the Model

Context Engineering is the core concept that replaced Prompt Engineering in 2025: the focus shifted from 'how to ask' to 'what information to provide.' Delivering the right information at the right time into the context window is more effective than upgrading to a stronger model. This post covers the definition, four key strategies, practical techniques, and common failure modes.

ai guide

Agent Memory Systems: From RAG to Read-Write Memory Evolution

RAG is read-only. Agent Memory lets AI not only read but also write and persist information. Three memory types: Procedural (behavior patterns), Episodic (temporal events), and Semantic (factual knowledge) form a complete cognitive memory system.

ai guide RAG 系統實戰

Multi-Agent RAG: Distributed Retrieval Architecture with Specialized Agent Collaboration

A single RAG Agent handling all queries hits knowledge boundaries and performance bottlenecks. Multi-Agent RAG dispatches retrieval tasks to multiple specialized Agents, each with its own knowledge base and retrieval strategy, coordinated by a central Orchestrator that merges results.

ai guide

LongRAG: Rethinking RAG Chunking Strategy with Long-Context Models

Traditional RAG splits documents into small chunks for retrieval, but this causes information fragmentation. LongRAG leverages 100K+ token long-context models to retrieve larger document segments (entire sections or even whole documents), reducing fragmentation while maintaining retrieval efficiency.

ai guide

Speculative RAG: Small Models Draft in Parallel, Large Model Verifies at Once

Speculative RAG uses small specialist models to generate multiple answer drafts from different document subsets in parallel, then a large model verifies and selects the best answer in one pass. Accuracy improves up to 12.97%, latency drops up to 50.83%.

ai guide RAG 系統實戰

The Complete Guide to RAG System Patterns: A Ten-Generation Evolution from Naive to Multi-Agent with Practical Navigation

RAG has evolved far beyond simple 'search + generate' into a technology ecosystem spanning ten generations. This article is a systematic navigation guide: from Naive RAG to Multi-Agent RAG across ten generations, covering retrieval strategies, chunking, embedding, reranking, evaluation frameworks, observability, and cost optimization. Each topic has a dedicated deep-dive article.

ai guide

Agentic RAG: Letting the LLM Decide When to Search Again

For complex multi-hop questions, a single RAG search isn't enough. Agentic RAG lets the LLM evaluate whether retrieved results are sufficient — if not, it rewrites the query and searches again, forming a ReAct loop.

ai guide

BGE-M3: Why This Embedding Model Works Well for Traditional Chinese RAG

Your choice of embedding model directly determines RAG search quality. BGE-M3's multilingual training, 1024-dimensional vectors, and matching Reranker make it a practical pick for Traditional Chinese RAG.

ai guide

Chunking Strategies: How You Split Text Determines Whether RAG Can Find the Answer

Chunks too large and retrieval loses precision; too small and you lose context. Chunking is the most underrated part of RAG — pick the wrong strategy and no amount of downstream optimization will save you.

ai guide

ColBERT: The Third Way in Vector Search

Bi-Encoders are too coarse, Cross-Encoders are too slow — ColBERT's Late Interaction finds the sweet spot: token-level comparison between query and document, but with document vectors that can be precomputed.

ai guide

Contextual Retrieval: Giving Every Chunk Its "What This Is About" Context

When you split a document into chunks, each chunk loses its place in the original document. Contextual Retrieval solves the isolated-chunk problem by injecting a document-level summary into every chunk at index time.

ai guide

CRAG: Automatically Relaxing Filters When Retrieval Comes Up Empty

Filters too strict and getting zero results? CRAG automatically relaxes them and retries — far better than letting the LLM hallucinate an answer from general knowledge.

ai guide

Cross-Encoder Reranking: Surfacing the Most Relevant Documents

Vector search similarity scores don't equal relevance. Cross-Encoders use pairwise comparison to reorder results and push the truly relevant documents to the top.

ai guide

GraphRAG: Structuring Knowledge as a Graph for Relationship-Based Reasoning

Vector search finds similarity; graph search traverses relationships. When a question requires reasoning across multiple entities — crag → route → sender → grade distribution — GraphRAG outperforms standard RAG.

ai guide RAG 系統實戰

Hybrid Search: Using BM25 + Vector Search to Cover Each Other's Blind Spots

Vector search handles semantics; BM25 handles keywords. Combining them with RRF is what lets you handle both fuzzy queries and exact terms at the same time.

ai guide

HyDE: Boosting Vector Search Recall with Hypothetical Answers

Have an LLM generate an 'ideal answer' first, then embed that hypothetical document for search — it outperforms searching with the raw query.

ai guide

RAG Personalization: Learning User Preferences from Conversations

After each conversation, asynchronously extract likely user preferences and skill level, then automatically personalize search parameters on the next query — no manual setup required.

ai guide

MMR + Popularity Weighting: Recommendations That Are Both Relevant and Diverse

Ranking purely by relevance leaves you with five documents all describing the same route. MMR strikes a balance between relevance and diversity, and layering in popularity weighting makes results even more useful.

ai deep-dive

Modular RAG Pipeline: Designing RAG as a Composable DAG

RAG doesn't have to be a rigid three-step process. It's a set of steps that can be dynamically enabled, skipped, or reordered. Pipeline as Code lets the system adapt its behavior without redeployment.

ai guide

Multi-Query Expansion: Search One Question from Multiple Angles

A single vector search on a complex query often misses relevant documents. Let the LLM rewrite the query into 3-5 sub-queries, run them in parallel, and recall improves significantly.

ai guide

Multimodal RAG: Bringing Images into the Knowledge Base

Climbing routes carry a ton of visual information (topos, wall photos) that text-only RAG misses entirely. Multimodal RAG makes images searchable and understandable.

ai deep-dive

Three Generations of RAG: From Naive to Modular

Naive RAG works but has real problems. Advanced RAG patches those problems. Modular RAG rearchitects the whole system to be composable and configurable. Understanding all three generations is the key to understanding why modern RAG systems look the way they do.

ai guide

Plan-and-Execute: A RAG Pattern That Plans Before It Acts

For complex queries, have the LLM map out what information is needed and in how many steps — then execute that plan. More systematic than thinking on the fly.

ai guide

Query Classification: Teaching Your RAG System How to Answer Each Question

Not every question needs full RAG. Classify queries with an LLM first, then route to the right execution path — saving cost and improving accuracy.

ai guide

RAG A/B Testing: A Scientific Approach to Comparing Pipeline Configurations

"Adding a Cross-Encoder feels better" is not a scientific evaluation. A/B testing tells you whether a change actually works, how much it helps, and which query types benefit.

ai guide

RAG Cold Start: Building a Useful System When You Have No Data

A RAG system needs data to answer questions, but data only accumulates as the system gets used. Cold-start strategy is what bridges the gap from empty to useful.

ai guide

RAG Cost Optimization: Minimizing the Cost of Every Query

RAG system costs come from LLM tokens, Embedding APIs, and vector search. Every stage has room for cost reduction, but you need to verify that optimizations don't sacrifice too much quality.

ai guide

RAG Evaluation Frameworks: How to Use RAGAS, DeepEval, and TruLens

RAG system quality is hard to evaluate by intuition alone. RAGAS, DeepEval, and TruLens provide systematic metric frameworks that pinpoint exactly which component is failing.

ai debug RAG 系統實戰

RAG Common Failure Modes: 10 Problems and Their Solutions

When a RAG system breaks, 90% of the time it's one of these 10 failure modes. Identify which one first, then apply the matching fix — far more effective than optimizing blindly.

ai guide

RAG Guardrails: Adding a Defense Layer to Inputs and Outputs

The attacks RAG systems face go beyond the technical level — Prompt Injection and Jailbreak are real threats. Both inputs and outputs need independent protection layers.

ai guide

RAG Observability Tool Landscape: Choices in 2026

Rolling your own traces is good enough, but open-source tools save you a lot of work. Langfuse, Phoenix, and LangSmith each have their niche — the right choice depends on your trade-offs around self-hosting, open source, and integration complexity.

ai guide

RAG Observability: 17-Step Tracing to Turn the Black Box Transparent

The hardest part of a RAG system isn't building it — it's figuring out why a particular answer went wrong. Pipeline Tracing records every step's decisions and data so debugging has a clear trail to follow.

ai guide

RAG Prompt Engineering: How to Design System Prompts and Context

Search found the right documents, but the LLM's answers are still poor — often the problem lies in prompt design. System prompt structure, context formatting, and instruction placement all affect output quality.

ai guide

RAG Streaming: Using SSE to Display LLM Responses as They Generate

LLM generation takes 3-5 seconds, and waiting for the full response before displaying it makes for a terrible experience. SSE pushes tokens as they're generated, reducing time-to-first-character from 5 seconds to under 1 second.

ai guide

RAG Quota System: Controlling LLM Costs with Dual Limits

Limiting request count alone is not enough — a single long query can consume ten times the tokens of a normal one. Dual quotas (request count + token count) are what truly control costs.

ai deep-dive

RAG vs Fine-tuning: It's Not Either/Or

RAG and Fine-tuning solve different problems. RAG gives the model new knowledge; Fine-tuning changes the model's behavior and style. In most cases you use both, not pick one.

ai guide

RRF: How to Merge Multi-Source Results in RAG Systems

BM25, vector search, HyDE, and Multi-Query each produce separate result sets -- how do you merge them sensibly? RRF uses ranks instead of scores, sidestepping the fundamental problem that scores from different systems are incomparable.

ai guide

Self-Reflection + LLM-as-Judge: Having AI Evaluate Its Own Answers

Use another LLM to evaluate answer accuracy and quality — if the score is too low, regenerate, and automatically add appropriate disclaimers.

ai guide

Semantic Caching: Run the RAG Pipeline Only Once for Semantically Similar Queries

Caching doesn't have to match exact query strings -- semantically similar questions can hit the cache too, skipping the entire RAG pipeline execution.

ai guide

SPLADE: Smarter Sparse Vector Search Beyond BM25

BM25 only recognizes words that appear in the query. SPLADE infers related terms and adds them to the search, gaining partial semantic capability while preserving the precision of keyword search.

ai guide

Text-to-SQL Router: Precise Queries That Skip RAG

Questions like 'how many routes did I complete this year' will never be answered well by RAG semantic search — querying the database directly is far more accurate. Let the LLM identify intent, extract parameters, and execute predefined SQL templates.

ai guide

Vector Database Selection: How to Choose Between Pinecone, Weaviate, Qdrant, and Vectorize

Vector database selection is more constrained by deployment platform than LLM selection. Determine your platform and scale requirements first, then evaluate features — don't just look at benchmarks.

product project

Why Does a Climbing Community Need AI? NobodyClimb's Experiment and What We Learned

NobodyClimb uses RAG to tackle scattered climbing route information, ties quota limits to community engagement, and leverages Cloudflare Workers AI to bring inference costs close to zero.

tech deep-dive

NobodyClimb: Building a Climbing Community Platform Entirely on Cloudflare

A climbing community platform where the web app, mobile app, and AI Q&A all run on Cloudflare — no dedicated servers.

tech deep-dive

NobodyClimb AI Architecture: Building a 20-Node RAG Pipeline on Cloudflare Workers

A dynamically composable RAG pipeline built on Cloudflare Workers AI (gemma-3-12b-it + bge-m3): 14 base steps + 6 LangGraph-specific nodes, with three strategy graphs (Baseline / Agentic / Plan-Execute) selected at runtime.