🌏 中文版
After a RAG system goes live, a user reports “this answer is wrong” — how do you investigate?
Without observability, all you can do is guess: Did vector search retrieve the wrong documents? Was it an LLM hallucination? Were the filter conditions too strict, resulting in zero results? Did the reranker push good documents to the bottom? Any step could be the problem, but you have no way to tell which one.
RAG Tracing solves this: record every pipeline step’s inputs, outputs, and decisions so that when something goes wrong, you can reconstruct the execution process step by step.
Trace Data Structure
interface PipelineTrace {
queryId: string;
totalDurationMs: number;
cache: {
hit: boolean;
similarity?: number; // Similarity score on cache hit
};
toolSelection: {
queryType: string;
reasoning: string; // LLM's reasoning for classification
sqlTemplateId?: string;
usedFallback: boolean; // Whether regex fallback was used
durationMs: number;
};
hyde: {
skipped: boolean;
generatedDoc?: string; // Generated hypothetical document (for debugging)
durationMs?: number;
};
multiQuery: {
skipped: boolean;
subQueries?: string[]; // List of generated sub-queries
durationMs?: number;
};
filterBuild: {
extractedFilters: VectorFilter;
usedNlp: boolean;
nlpMethod?: string; // LLM / regex / none
};
retrieval: {
vectorCandidates: number; // Number of vector search hits
bm25Candidates: number; // Number of BM25 hits
rrfMerged: number; // Count after RRF fusion
cragTriggered: boolean; // Whether relaxed retry was triggered
relaxedFilter?: VectorFilter;
durationMs: number;
};
crossEncoder: {
skipped: boolean;
inputCount?: number;
outputCount?: number; // Count remaining after threshold filtering
threshold?: number;
durationMs?: number;
};
mmr: {
lambda: number;
inputCount: number;
outputCount: number;
durationMs: number;
};
generation: {
model: string; // Which LLM was used
promptTokens: number;
completionTokens: number;
injectedDocuments: string[]; // IDs of documents injected into the prompt
durationMs: number;
};
judge: {
groundedness: number;
quality: number;
reasoning: string; // Judge's scoring rationale
durationMs: number;
};
selfReflection: {
triggered: boolean;
accepted?: boolean; // Whether the regenerated answer was adopted
originalGroundedness?: number;
regenGroundedness?: number;
};
}
What to Record and Why
The key is to record not just “what happened” but also “why it happened that way”:
toolSelection.reasoning: The rationale the LLM provided when classifying the query (“This query contains counting intent, suitable for SQL template”). The classification result alone isn’t enough — the reasoning helps you judge whether the classification was sound.
toolSelection.usedFallback: If the LLM timed out and fell back to regex classification, this will be true. Low-quality query classifications are often caused by regex fallback.
retrieval.cragTriggered: Whether filters were relaxed and retrieval retried due to zero results. This signal indicates the query’s filter conditions may be too strict, or the database lacks content of this type.
generation.injectedDocuments: Which documents the LLM actually saw. If the answer is wrong, you can cross-reference these documents to determine whether the documents themselves were flawed or the LLM misinterpreted their content.
selfReflection.accepted: Whether the regenerated answer was adopted. If it was adopted but groundedness is still low, the problem lies in insufficient context, not the generation strategy.
Time Breakdown
interface TokenBreakdown {
embeddingMs: number; // Embedding computation time
retrievalMs: number; // Search + RRF time
rerankingMs: number; // Cross-encoder time
generationMs: number; // LLM generation time
judgeMs: number; // Judge scoring time
overheadMs: number; // Other (routing, DB writes, etc.)
totalMs: number;
}
The time breakdown gives optimization a direction: if most time is spent on generationMs, switch to a lighter model or shorten the context; if it’s rerankingMs, consider reducing the candidate count; if it’s embeddingMs, check whether parallelization has issues.
Storage Design
Traces are stored as JSON in the pipeline_trace table:
CREATE TABLE pipeline_trace (
id TEXT PRIMARY KEY,
query_log_id TEXT NOT NULL, -- Links to ai_query_logs
trace_data TEXT NOT NULL, -- Full trace in JSON format
created_at INTEGER NOT NULL
);
The reason for choosing JSON over normalized tables: the trace structure evolves as the pipeline changes, and JSON doesn’t require migrations. The admin dashboard parses JSON fields directly when querying.
When query volume is high, only the last 30 days of traces are retained, with older ones cleaned up periodically:
DELETE FROM pipeline_trace
WHERE created_at < unixepoch() - 30 * 86400;
Admin Dashboard Trace View
The admin AI Log page renders trace data as a timeline view:
Query: What 5.11 routes are at Longdong? Total: 6.2s
[Cache] Miss 0ms
[Query Classify] complex (confidence 0.92) 380ms
[HyDE] Generated hypothetical doc (89 chars) 820ms
[Multi-Query] Generated 3 sub-queries 610ms
[Filter Build] crag_id=longtung, grade>=110 45ms
[Hybrid Search] Vector 18 + BM25 12 → RRF 22 340ms
[Cross-Encoder] 22 → 8 (threshold 0.5) 290ms
[MMR] 8 → 5 (lambda=0.7) 12ms
[LLM Generation] Gemma-3-12b, 1240 tokens 3,840ms
[Judge] groundedness 0.87, quality 3 510ms
[Self-Reflection] Not triggered (quality > 2) 0ms
[Output Guard] Passed 8ms
Each step can be expanded to show detailed data (generated hypothetical document content, sub-query list, injected document list).
Using Traces to Pinpoint Issues
Problem: Irrelevant answer
- Check
toolSelection.queryType: Was the classification correct? - Check
filterBuild.extractedFilters: Were the filter conditions correct? - Check
retrieval.vectorCandidates+bm25Candidates: Were any results found? - Check
generation.injectedDocuments: Were the documents the LLM saw actually the right ones?
Problem: Hallucinated answer
- Check
judge.groundedness: Is it below 0.6? - Check
generation.injectedDocuments: Do these documents contain relevant content? - Check
retrieval.cragTriggered: Did CRAG relax filters, bringing in irrelevant documents?
Problem: Slow answer
- Check
tokenBreakdown: Which step took the most time? - Check
judge.durationMs: Is the Judge slowing down the critical path? (Judge should be async) - Check
generation.promptTokens: Is the context too long?
The Big Picture
Observability is the critical gap between a RAG system that “runs” and one that’s “operable.” Without traces, every issue requires guesswork; with traces, problems can almost always be pinpointed to a specific step.
When designing traces, recording “decision rationale” is more valuable than recording “execution results.” Results can often be inferred from the answer itself — rationale is the key to debugging. Why did the LLM classify the query this way? Why did CRAG trigger? Why didn’t Self-Reflection adopt the new answer? The answers to all these questions live in the trace.
References
Loading...