Skip to content

#llm-judge

2 篇文章

ai deep-dive 2026年6月4日

調整 agent 之後，怎麼嚴謹比較前後差異：從 golden set 到統計檢定

即使 temperature=0，LLM 輸出實測仍可能抖動 15%。要嚴謹比較 agent 調整前後，得靠凍結 golden set、每題跑 ≥3 次取平均、LLM-as-judge 盲評（pairwise 偏好翻轉率高達 35%）與配對統計檢定，而不是前後各問一遍看感覺。

#evaluation #rag #llm-judge #ab-testing #ai-agent #llm

ai guide 2026年3月12日

Self-Reflection + LLM-as-Judge：讓 AI 評估自己的回答

用另一個 LLM 評估回答的準確度和品質，分數太低就重新生成，並自動加上適當的免責聲明。

#rag #llm-judge #self-reflection #groundedness #quality-assurance