judges by quotient-ai

LLM evaluation library for classifiers and graders

Created 1 year ago

301 stars

Top 88.5% on SourcePulse

View on GitHub

2 Experts Love This Project

Will Brown

Research Lead at Prime Intellect

Rotem Weiss

Cofounder of Tavily

Project Summary

This library provides a curated set of LLM-as-a-Judge evaluators for assessing AI model outputs, targeting developers and researchers building and evaluating LLM applications. It offers a low-friction format for using and creating LLM evaluators, backed by research, to improve output quality and reliability.

How It Works

The library offers two primary judge types: Classifiers (returning boolean True/False for evaluation pass/fail) and Graders (returning numerical or Likert scale scores). Judges are invoked via a .judge() method, accepting input, output, and optional expected values. The library automatically resolves boolean outputs from underlying LLM prompts. A Jury object allows combining multiple judges for diversified and averaged judgments, producing a Verdict.

Quick Start & Requirements

Install via pip install judges.
Requires an API key for the chosen LLM provider (e.g., OPENAI_API_KEY).
Example usage involves importing judge classes and calling the .judge() method with model outputs.

Highlighted Details

Supports creating custom judges by inheriting from BaseJudge and implementing a .judge() method.
Includes an AutoJudge feature to create task-specific judges from labeled datasets and descriptions.
Provides a CLI for evaluating single or batch test cases with various judges.
Offers a comprehensive appendix listing classifiers and graders with their descriptions and reference papers.

Maintenance & Community

No specific community channels or notable contributors are mentioned in the README.

Licensing & Compatibility

The library's license is not explicitly stated in the README.

Limitations & Caveats

The README does not specify any limitations or known caveats regarding the library's functionality or stability.

Health Check

Last Commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)