Framework for LLM-as-a-judge systems, scaling evaluation
Top 97.5% on sourcepulse
Verdict is a declarative Python framework for building and scaling LLM-as-a-judge systems. It addresses the unreliability of current LLM judges by enabling users to compose complex, multi-step evaluation protocols, offering a plug-and-play approach for rapid iteration across different LLM models, prompts, and aggregation strategies. This allows for more robust and efficient evaluation of AI applications, guardrails, and reinforcement learning tasks.
How It Works
Verdict utilizes a composable architecture of Unit
, Layer
, and Block
primitives to construct sophisticated judge protocols. These primitives can be chained and repeated to create hierarchical reasoning, debate, and aggregation patterns. This approach allows for scaling judge-time compute by synthesizing multiple LLM calls, grounded in research from scalable oversight and automated evaluation, to achieve state-of-the-art performance with reduced latency and cost compared to monolithic reasoning models.
Quick Start & Requirements
pip install verdict
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The framework is designed for LLM-as-a-judge tasks and may require custom primitives for evaluation scenarios outside this scope. While it aims to mitigate LLM judge unreliability, the effectiveness of composed judges still depends on the underlying LLM capabilities and prompt engineering.
2 weeks ago
1 day