evals  by openai

Framework for evaluating LLMs and LLM systems, plus benchmark registry

created 2 years ago
16,657 stars

Top 2.9% on sourcepulse

GitHubView on GitHub
Project Summary

Evals is a framework and registry for evaluating Large Language Models (LLMs) and LLM-powered systems. It allows users to run existing benchmarks, create custom evaluations using their own data, and test various dimensions of LLM performance. The target audience includes developers, researchers, and prompt engineers building with LLMs, providing a structured way to measure and improve model behavior.

How It Works

Evals utilizes a flexible YAML-based configuration system for defining evaluation tasks, metrics, and datasets. It supports custom logic through Python, enabling complex evaluation scenarios like prompt chaining and tool-using agents via a Completion Function Protocol. The framework is designed to be extensible, allowing users to easily contribute new benchmarks or adapt existing ones to their specific use cases.

Quick Start & Requirements

  • Install via pip: pip install evals
  • Requires Python 3.9+ and an OpenAI API key (set as OPENAI_API_KEY environment variable).
  • Fetching benchmark data requires Git LFS: git lfs fetch --all and git lfs pull.
  • Official quick-start and documentation: https://github.com/openai/evals

Highlighted Details

  • Integrated with OpenAI Dashboard for direct configuration and execution.
  • Supports logging results to Snowflake databases.
  • Offers a registry of pre-built evals for various LLM capabilities.
  • Allows private evals using custom data without public exposure.

Maintenance & Community

The project is maintained by OpenAI. Contributions are accepted via pull requests. Further community interaction details (Discord/Slack) are not explicitly mentioned in the README.

Licensing & Compatibility

The repository is licensed under the MIT License. Contributions to evals are also made under the MIT License. OpenAI reserves the right to use contributed data for service improvements.

Limitations & Caveats

Currently, custom code submissions for evals are not accepted, though custom model-graded YAML files are permitted. A known issue exists where evals may hang at the end, requiring manual interruption.

Health Check
Last commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
2
Star History
670 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Wei-Lin Chiang Wei-Lin Chiang(Cofounder of LMArena).

evalplus by evalplus

0.5%
2k
LLM code evaluation framework for rigorous testing
created 2 years ago
updated 4 weeks ago
Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Simon Willison Simon Willison(Author of Django), and
9 more.

simple-evals by openai

0.4%
4k
Lightweight library for evaluating language models
created 1 year ago
updated 3 weeks ago
Feedback? Help us improve.