#llm-judge — quidproquo

ai deep-dive Jun 4, 2026

How to Rigorously Compare Before and After Agent Changes: From Golden Sets to Statistical Testing

Even with temperature=0, LLM outputs can still fluctuate by up to 15% in practice. To rigorously compare agent changes, you need a frozen golden set, at least 3 runs per query averaged out, LLM-as-judge blind evaluation (pairwise preference flip rate reaches 35%), and paired statistical tests -- not just running each version once and going by feel.

#evaluation #rag #llm-judge #ab-testing #ai-agent #llm

ai guide Mar 12, 2026

Self-Reflection + LLM-as-Judge: Having AI Evaluate Its Own Answers

Use another LLM to evaluate answer accuracy and quality — if the score is too low, regenerate, and automatically add appropriate disclaimers.

#rag #llm-judge #self-reflection #groundedness #quality-assurance