🌏 中文版
The observability requirements for RAG systems are clear: trace the execution of every query, log LLM inputs and outputs, evaluate answer quality, and identify which step problems concentrate in.
You can build it yourself (the previous post covered pipeline trace design), or use existing tools. The upside of tools is out-of-the-box UI, built-in evaluation features, and team collaboration support; the cost is one more external dependency.
The mainstream choices in 2026:
Langfuse
Positioning: An open-source Observability platform for LLM applications — the most popular self-hosted option.
Core Features:
- Trace view: complete LLM call trees (inputs, outputs, latency, token counts)
- Session management: link multi-turn conversations into a single session
- Evaluation framework: custom scorers with LLM-as-Judge integration
- Dataset management: collect real queries for regression testing
- Prompt management: versioned prompts with tracking of which prompt version performs best
SDK Integration:
import Langfuse from "langfuse";
const langfuse = new Langfuse({
secretKey: process.env.LANGFUSE_SECRET_KEY,
publicKey: process.env.LANGFUSE_PUBLIC_KEY,
baseUrl: "https://cloud.langfuse.com", // or self-hosted
});
// Record a trace in the RAG pipeline
const trace = langfuse.trace({
name: "rag-query",
input: { query },
userId: userId,
});
const retrievalSpan = trace.span({
name: "hybrid-search",
input: { filter, topK },
});
// ... search execution ...
retrievalSpan.end({
output: { candidateCount: results.length },
metadata: { cragTriggered: false },
});
const generationSpan = trace.span({
name: "llm-generation",
input: { messages },
});
// ... generation execution ...
generationSpan.end({
output: { response: answer },
usage: { promptTokens, completionTokens },
});
trace.update({ output: { answer, sources } });
await langfuse.flushAsync();
Strengths:
- Open source and self-hostable (EU data compliance)
- Comprehensive evaluation features (human annotation + LLM judge)
- Prompt version management is the most complete among similar tools
Weaknesses:
- Self-hosting has maintenance overhead (requires PostgreSQL + Redis)
- Dashboard customization flexibility is limited
Best for: Teams that need data to stay on their own infrastructure and value prompt version management.
Phoenix (Arize AI)
Positioning: Open-source AI Observability with a strong emphasis on evaluation and dataset curation.
Core Features:
- Trace view (similar to Langfuse)
- Built-in RAG evaluation metrics: Hallucination, QA Correctness, Relevance
- Embedding visualization: project embeddings onto 2D with UMAP to inspect cluster structure
- Experiment framework: A/B comparison of different pipeline configurations
Most Unique Feature: Embedding Visualization
import phoenix as px
# Project query embeddings onto 2D to visualize query distribution
px.launch_app(trace_dataset)
You can see which queries cluster together in vector space and which are isolated (possibly due to poor embedding quality or the database lacking relevant content). This visualization is very helpful for discovering systematic blind spots in RAG systems.
Strengths:
- Fully open source (Apache 2.0)
- Embedding visualization is a unique selling point
- Best integration with LlamaIndex and LangChain
Weaknesses:
- Primarily a Python ecosystem — TypeScript SDK has fewer features
- Weaker prompt management compared to Langfuse
Best for: Python tech stacks and scenarios requiring deep embedding quality analysis.
LangSmith
Positioning: LangChain’s official Observability platform, deeply integrated with LangChain / LangGraph.
Core Features:
- Automatic tracing (nearly zero configuration when using LangChain)
- Playground: debug prompts directly in the UI
- Annotation Queue: human labeling queue suitable for small teams doing human eval
- Dataset + Evaluation: systematic regression testing framework
Integration:
// If using LangChain, just set environment variables
process.env.LANGCHAIN_TRACING_V2 = "true";
process.env.LANGCHAIN_API_KEY = "...";
// All LangChain calls are automatically traced — no additional code needed
const chain = new RetrievalQAChain({ ... });
await chain.call({ query });
Strengths:
- Simplest setup when using LangChain
- Complete dataset management and evaluation framework
- Playground is convenient for prompt engineering
Weaknesses:
- Closed source — data lives on LangChain’s servers
- Integration complexity increases if you’re not using LangChain
- Relatively expensive (enterprise tier)
Best for: Teams on the LangChain tech stack that need a comprehensive evaluation framework.
Helicone
Positioning: Proxy-layer observability for LLM APIs — the most lightweight option.
Core Features:
- Acts as a proxy for LLM APIs, automatically capturing all calls
- Cost tracking (by model, user, time period)
- Rate limiting and caching (at the proxy layer)
- Request replay
Integration:
// Just change the baseURL — no other code changes needed
const openai = new OpenAI({
baseURL: "https://oai.helicone.ai/v1",
defaultHeaders: {
"Helicone-Auth": `Bearer ${process.env.HELICONE_API_KEY}`,
},
});
Strengths:
- Lowest integration cost (change one baseURL line)
- Most detailed cost analysis among similar tools
- Supports OpenAI, Anthropic, Gemini, and self-hosted models
Weaknesses:
- Cannot see RAG-level traces (only LLM calls — no visibility into preceding search steps)
- Basic evaluation features
- Data lives on Helicone’s servers
Best for: Scenarios that only need LLM cost monitoring and basic usage metrics without deep RAG tracing.
Comparison Summary
| Langfuse | Phoenix | LangSmith | Helicone | |
|---|---|---|---|---|
| Open Source | ✅ | ✅ | ❌ | ❌ |
| Self-Hosted | ✅ | ✅ | ❌ | ❌ |
| RAG Trace Depth | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ | ⭐ |
| Embedding Visualization | ❌ | ✅ | ❌ | ❌ |
| Prompt Management | ⭐⭐⭐ | ⭐ | ⭐⭐ | ❌ |
| Evaluation Framework | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ | ⭐ |
| TypeScript SDK | ✅ | 🟡 | ✅ | ✅ |
| Integration Complexity | Medium | Medium | Low (LangChain) / High (other) | Lowest |
How to Choose
Self-hosted + full features → Langfuse. Currently the most mature open-source option with a comprehensive evaluation framework; prompt version management is a bonus.
Need embedding visualization → Phoenix. Embedding cluster analysis is a unique capability no other tool offers.
On the LangChain tech stack → LangSmith. Zero setup cost, and Playground makes prompt iteration convenient.
Only need cost monitoring, don’t want to change code → Helicone. Change one baseURL line and you immediately get cost reports.
Roll your own traces → Best for scenarios with special requirements or where you want full control over trace data structures. The cost is maintaining your own UI and query interface, but you get complete customization.
NobodyClimb’s system went with custom traces, mainly because it’s deployed on Cloudflare Workers (which can’t easily run external SDKs’ flush mechanisms), and trace data needs tight integration with business data (climbing routes, user profiles). But if starting from scratch with no platform constraints, Langfuse would be the first option to try.
References
- Langfuse Documentation
- Langfuse GitHub Repository
- Phoenix (Arize AI) Documentation
- Phoenix GitHub Repository
- LangSmith Documentation
- Helicone Documentation
- OpenLLMetry - OpenTelemetry for LLMs (GitHub)
- NobodyClimb System Architecture: Cloudflare Full-Stack Climbing Community Platform
- NobodyClimb AI Architecture: 20-Node RAG Pipeline
Loading...