prometheus by prometheus-eval

Evaluator LM for fine-grained assessment using customized rubrics

Created 2 years ago

311 stars

Top 86.8% on SourcePulse

Project Summary

Prometheus provides an open-source, reproducible, and cost-effective solution for fine-grained evaluation of language models using a customized score rubric. It serves as an alternative to human or GPT-4 evaluation, targeting researchers and developers needing detailed LLM performance assessments.

How It Works

Prometheus is an evaluator LM fine-tuned to provide detailed feedback and assign scores based on a provided rubric. It uses a specific prompt format that includes the instruction, response to evaluate, a reference answer for a perfect score, and the detailed scoring criteria. The model then generates feedback and a score between 1 and 5, formatted as "Feedback: (feedback) [RESULT] (score)". This approach allows for precise, rubric-driven evaluations, moving beyond general quality assessments.

Quick Start & Requirements

Install dependencies: pip install -r requirements.txt
Inference requires a Hugging Face TGI server URL.
Training requires torchrun and is built upon llama-recipes.
See inference directory for example inference scripts.

Highlighted Details

Fine-grained evaluation on custom score rubrics.
Reproducible evaluation framework.
Alternative to human and GPT-4 evaluation.
Trained on the Feedback Collection dataset.

Maintenance & Community

Project associated with ICLR 2024 and NeurIPS 2023 workshops.
Citation available for academic use.

Licensing & Compatibility

License details are not explicitly stated in the README.

Limitations & Caveats

The README does not specify the base model used for Prometheus or provide explicit licensing information, which may impact commercial use or integration into closed-source projects. Inference setup requires a separate TGI server.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days