verdict by haizelabs

Framework for LLM-as-a-judge systems, scaling evaluation

Created 1 year ago

322 stars

Top 84.5% on SourcePulse

View on GitHub

5 Experts Love This Project

Didier Lopes

Founder of OpenBB

Thomas Wolf

Cofounder of Hugging Face

Nathan Lambert

Research Scientist at AI2

Will Brown

Research Lead at Prime Intellect

and 1 more!

Project Summary

Verdict is a declarative Python framework for building and scaling LLM-as-a-judge systems. It addresses the unreliability of current LLM judges by enabling users to compose complex, multi-step evaluation protocols, offering a plug-and-play approach for rapid iteration across different LLM models, prompts, and aggregation strategies. This allows for more robust and efficient evaluation of AI applications, guardrails, and reinforcement learning tasks.

How It Works

Verdict utilizes a composable architecture of Unit, Layer, and Block primitives to construct sophisticated judge protocols. These primitives can be chained and repeated to create hierarchical reasoning, debate, and aggregation patterns. This approach allows for scaling judge-time compute by synthesizing multiple LLM calls, grounded in research from scalable oversight and automated evaluation, to achieve state-of-the-art performance with reduced latency and cost compared to monolithic reasoning models.

Quick Start & Requirements

Install via pip: pip install verdict
Requires Python 3.7+
Supports various LLM providers (e.g., OpenAI, Anthropic, Cohere) via configuration.
Official Docs: https://verdict.haizelabs.com/docs
Whitepaper: https://verdict.haizelabs.com/whitepaper.pdf

Highlighted Details

Achieves state-of-the-art or near state-of-the-art performance on benchmarks for content moderation, hallucination detection, and fact-checking.
Enables hierarchical reasoning and debate-aggregation patterns for complex evaluations.
Integrates with DSPy for use as a metric in AI system optimization.
Client-side rate limiting prevents experiment loss.

Maintenance & Community

Active development with a growing community.
Discord server available for support and discussion: https://discord.gg/CzfKnCMvwx

Licensing & Compatibility

Licensed under the Apache License 2.0.
Permissive license suitable for commercial use and integration into closed-source applications.

Limitations & Caveats

The framework is designed for LLM-as-a-judge tasks and may require custom primitives for evaluation scenarios outside this scope. While it aims to mitigate LLM judge unreliability, the effectiveness of composed judges still depends on the underlying LLM capabilities and prompt engineering.

Health Check

Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

9 stars in the last 30 days