Framework for evaluating LLMs and LLM systems, plus benchmark registry
Top 2.9% on sourcepulse
Evals is a framework and registry for evaluating Large Language Models (LLMs) and LLM-powered systems. It allows users to run existing benchmarks, create custom evaluations using their own data, and test various dimensions of LLM performance. The target audience includes developers, researchers, and prompt engineers building with LLMs, providing a structured way to measure and improve model behavior.
How It Works
Evals utilizes a flexible YAML-based configuration system for defining evaluation tasks, metrics, and datasets. It supports custom logic through Python, enabling complex evaluation scenarios like prompt chaining and tool-using agents via a Completion Function Protocol. The framework is designed to be extensible, allowing users to easily contribute new benchmarks or adapt existing ones to their specific use cases.
Quick Start & Requirements
pip install evals
OPENAI_API_KEY
environment variable).git lfs fetch --all
and git lfs pull
.Highlighted Details
Maintenance & Community
The project is maintained by OpenAI. Contributions are accepted via pull requests. Further community interaction details (Discord/Slack) are not explicitly mentioned in the README.
Licensing & Compatibility
The repository is licensed under the MIT License. Contributions to evals are also made under the MIT License. OpenAI reserves the right to use contributed data for service improvements.
Limitations & Caveats
Currently, custom code submissions for evals are not accepted, though custom model-graded YAML files are permitted. A known issue exists where evals may hang at the end, requiring manual interruption.
7 months ago
Inactive