verdict  by haizelabs

Framework for LLM-as-a-judge systems, scaling evaluation

created 7 months ago
264 stars

Top 97.5% on sourcepulse

GitHubView on GitHub
Project Summary

Verdict is a declarative Python framework for building and scaling LLM-as-a-judge systems. It addresses the unreliability of current LLM judges by enabling users to compose complex, multi-step evaluation protocols, offering a plug-and-play approach for rapid iteration across different LLM models, prompts, and aggregation strategies. This allows for more robust and efficient evaluation of AI applications, guardrails, and reinforcement learning tasks.

How It Works

Verdict utilizes a composable architecture of Unit, Layer, and Block primitives to construct sophisticated judge protocols. These primitives can be chained and repeated to create hierarchical reasoning, debate, and aggregation patterns. This approach allows for scaling judge-time compute by synthesizing multiple LLM calls, grounded in research from scalable oversight and automated evaluation, to achieve state-of-the-art performance with reduced latency and cost compared to monolithic reasoning models.

Quick Start & Requirements

Highlighted Details

  • Achieves state-of-the-art or near state-of-the-art performance on benchmarks for content moderation, hallucination detection, and fact-checking.
  • Enables hierarchical reasoning and debate-aggregation patterns for complex evaluations.
  • Integrates with DSPy for use as a metric in AI system optimization.
  • Client-side rate limiting prevents experiment loss.

Maintenance & Community

Licensing & Compatibility

  • Licensed under the Apache License 2.0.
  • Permissive license suitable for commercial use and integration into closed-source applications.

Limitations & Caveats

The framework is designed for LLM-as-a-judge tasks and may require custom primitives for evaluation scenarios outside this scope. While it aims to mitigate LLM judge unreliability, the effectiveness of composed judges still depends on the underlying LLM capabilities and prompt engineering.

Health Check
Last commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
65 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.