prometheus-eval  by prometheus-eval

LLM evaluation framework using open LLMs

created 1 year ago
977 stars

Top 38.6% on sourcepulse

GitHubView on GitHub
Project Summary

Prometheus-Eval provides a suite of tools for evaluating Large Language Models (LLMs) in generation tasks, offering an open-source alternative to proprietary evaluation methods. It enables users to assess LLM responses using its specialized Prometheus models, which are trained to act as impartial judges, supporting both absolute grading (1-5 scores) and relative grading (A/B comparisons).

How It Works

Prometheus-Eval leverages its family of Prometheus LLMs, specifically designed for evaluation. These models are fine-tuned to understand and apply complex scoring rubrics to LLM-generated text. The system supports flexible inference through local execution via vLLM or integration with LLM APIs via LiteLLM, allowing users to utilize powerful models like GPT-4 or their own hosted Prometheus instances.

Quick Start & Requirements

Highlighted Details

  • Prometheus 2 (8x7B) achieves a Pearson correlation of 0.6-0.7 with GPT-4-1106 on Likert scale benchmarks and 72-85% agreement with human judgments on pairwise ranking benchmarks.
  • Supports both absolute grading (1-5 scores) and pairwise ranking (A/B comparison) through different prompt formats.
  • Offers batch processing for significantly faster evaluation of multiple response pairs.
  • Includes specialized datasets like BiGGen-Bench for comprehensive LLM evaluation.

Maintenance & Community

The project is actively developed, with recent releases of M-Prometheus models and the BiGGen-Bench. Key contributors and affiliations are listed in the paper citations. Community interaction channels are not explicitly mentioned in the README.

Licensing & Compatibility

The project utilizes Apache 2.0 license, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

The library is noted to be in a beta stage, and users are encouraged to report any encountered issues. While Prometheus 2 (7B) is suitable for consumer GPUs (16GB VRAM), the larger 8x7B model will have higher hardware requirements.

Health Check
Last commit

3 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
2
Star History
53 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.