August 2025

Judging judges: Building trustworthy LLM evaluations

TL;DRLLM-as-a-Judge systems can be fooled by confident-sounding but wrong answers, giving teams false confidence in their models. We built a human-labeled dataset and used our open-source framework syftr to systematically…