evals by openai

Framework for evaluating LLMs and LLM systems, plus benchmark registry

Created 3 years ago

17,533 stars

Top 2.7% on SourcePulse

View on GitHub

37 Experts Love This Project

Anastasios Angelopoulos

Cofounder of LMArena

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Ankush Gola

Cofounder of LangChain

Taranjeet Singh

Cofounder of Mem0

and 33 more!

Project Summary

Evals is a framework and registry for evaluating Large Language Models (LLMs) and LLM-powered systems. It allows users to run existing benchmarks, create custom evaluations using their own data, and test various dimensions of LLM performance. The target audience includes developers, researchers, and prompt engineers building with LLMs, providing a structured way to measure and improve model behavior.

How It Works

Evals utilizes a flexible YAML-based configuration system for defining evaluation tasks, metrics, and datasets. It supports custom logic through Python, enabling complex evaluation scenarios like prompt chaining and tool-using agents via a Completion Function Protocol. The framework is designed to be extensible, allowing users to easily contribute new benchmarks or adapt existing ones to their specific use cases.

Quick Start & Requirements

Install via pip: pip install evals
Requires Python 3.9+ and an OpenAI API key (set as OPENAI_API_KEY environment variable).
Fetching benchmark data requires Git LFS: git lfs fetch --all and git lfs pull.
Official quick-start and documentation: https://github.com/openai/evals

Highlighted Details

Integrated with OpenAI Dashboard for direct configuration and execution.
Supports logging results to Snowflake databases.
Offers a registry of pre-built evals for various LLM capabilities.
Allows private evals using custom data without public exposure.

Maintenance & Community

The project is maintained by OpenAI. Contributions are accepted via pull requests. Further community interaction details (Discord/Slack) are not explicitly mentioned in the README.

Licensing & Compatibility

The repository is licensed under the MIT License. Contributions to evals are also made under the MIT License. OpenAI reserves the right to use contributed data for service improvements.

Limitations & Caveats

Currently, custom code submissions for evals are not accepted, though custom model-graded YAML files are permitted. A known issue exists where evals may hang at the end, requiring manual interruption.

Health Check

Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

131 stars in the last 30 days