#evaluation

4 posts

ai deep-dive Jun 4, 2026

How to Rigorously Compare Before and After Agent Changes: From Golden Sets to Statistical Testing

Even with temperature=0, LLM outputs can still fluctuate by up to 15% in practice. To rigorously compare agent changes, you need a frozen golden set, at least 3 runs per query averaged out, LLM-as-judge blind evaluation (pairwise preference flip rate reaches 35%), and paired statistical tests -- not just running each version once and going by feel.

#evaluation #rag #llm-judge #ab-testing #ai-agent #llm

ai guide Mar 26, 2026

Langfuse Complete Guide: LLM Application Observability from Scratch

Langfuse is currently the most mature open-source LLM Observability platform. This post covers four core capabilities — Tracing, Prompt Management, Evaluation, and Datasets — showing you how to use them in real projects.

#langfuse #observability #tracing #llm #prompt-management #evaluation #monitoring

ai guide RAG 系統實戰 Mar 14, 2026

The Complete Guide to RAG System Patterns: A Ten-Generation Evolution from Naive to Multi-Agent with Practical Navigation

RAG has evolved far beyond simple 'search + generate' into a technology ecosystem spanning ten generations. This article is a systematic navigation guide: from Naive RAG to Multi-Agent RAG across ten generations, covering retrieval strategies, chunking, embedding, reranking, evaluation frameworks, observability, and cost optimization. Each topic has a dedicated deep-dive article.

#rag #guide #retrieval #embedding #reranking #evaluation #agent

ai guide Mar 12, 2026

RAG Evaluation Frameworks: How to Use RAGAS, DeepEval, and TruLens

RAG system quality is hard to evaluate by intuition alone. RAGAS, DeepEval, and TruLens provide systematic metric frameworks that pinpoint exactly which component is failing.

#rag #evaluation #ragas #deepeval #trulens #metrics #quality