judges  by quotient-ai

LLM evaluation library for classifiers and graders

created 11 months ago
255 stars

Top 98.8% on SourcePulse

GitHubView on GitHub
Project Summary

This library provides a curated set of LLM-as-a-Judge evaluators for assessing AI model outputs, targeting developers and researchers building and evaluating LLM applications. It offers a low-friction format for using and creating LLM evaluators, backed by research, to improve output quality and reliability.

How It Works

The library offers two primary judge types: Classifiers (returning boolean True/False for evaluation pass/fail) and Graders (returning numerical or Likert scale scores). Judges are invoked via a .judge() method, accepting input, output, and optional expected values. The library automatically resolves boolean outputs from underlying LLM prompts. A Jury object allows combining multiple judges for diversified and averaged judgments, producing a Verdict.

Quick Start & Requirements

  • Install via pip install judges.
  • Requires an API key for the chosen LLM provider (e.g., OPENAI_API_KEY).
  • Example usage involves importing judge classes and calling the .judge() method with model outputs.

Highlighted Details

  • Supports creating custom judges by inheriting from BaseJudge and implementing a .judge() method.
  • Includes an AutoJudge feature to create task-specific judges from labeled datasets and descriptions.
  • Provides a CLI for evaluating single or batch test cases with various judges.
  • Offers a comprehensive appendix listing classifiers and graders with their descriptions and reference papers.

Maintenance & Community

No specific community channels or notable contributors are mentioned in the README.

Licensing & Compatibility

The library's license is not explicitly stated in the README.

Limitations & Caveats

The README does not specify any limitations or known caveats regarding the library's functionality or stability.

Health Check
Last commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
25 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.