LLM evaluation library for classifiers and graders
Top 98.8% on SourcePulse
This library provides a curated set of LLM-as-a-Judge evaluators for assessing AI model outputs, targeting developers and researchers building and evaluating LLM applications. It offers a low-friction format for using and creating LLM evaluators, backed by research, to improve output quality and reliability.
How It Works
The library offers two primary judge types: Classifiers (returning boolean True/False for evaluation pass/fail) and Graders (returning numerical or Likert scale scores). Judges are invoked via a .judge()
method, accepting input, output, and optional expected values. The library automatically resolves boolean outputs from underlying LLM prompts. A Jury
object allows combining multiple judges for diversified and averaged judgments, producing a Verdict
.
Quick Start & Requirements
pip install judges
.OPENAI_API_KEY
)..judge()
method with model outputs.Highlighted Details
BaseJudge
and implementing a .judge()
method.AutoJudge
feature to create task-specific judges from labeled datasets and descriptions.Maintenance & Community
No specific community channels or notable contributors are mentioned in the README.
Licensing & Compatibility
The library's license is not explicitly stated in the README.
Limitations & Caveats
The README does not specify any limitations or known caveats regarding the library's functionality or stability.
2 weeks ago
1 day