#ab-testing — quidproquo

ai deep-dive Jun 4, 2026

How to Rigorously Compare Before and After Agent Changes: From Golden Sets to Statistical Testing

Even with temperature=0, LLM outputs can still fluctuate by up to 15% in practice. To rigorously compare agent changes, you need a frozen golden set, at least 3 runs per query averaged out, LLM-as-judge blind evaluation (pairwise preference flip rate reaches 35%), and paired statistical tests -- not just running each version once and going by feel.

#evaluation #rag #llm-judge #ab-testing #ai-agent #llm

ai guide Mar 12, 2026

RAG A/B Testing: A Scientific Approach to Comparing Pipeline Configurations

"Adding a Cross-Encoder feels better" is not a scientific evaluation. A/B testing tells you whether a change actually works, how much it helps, and which query types benefit.

#rag #ab-testing #experimentation #metrics #pipeline