simple-evals by openai

Lightweight library for evaluating language models

Created 1 year ago

4,281 stars

Top 11.4% on SourcePulse

View on GitHub

19 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Patrick von Platen

Author of Hugging Face Diffusers; Research Engineer at Mistral

Simon Willison

Coauthor of Django

Yineng Zhang

Inference Lead at SGLang; Research Scientist at Together AI

and 15 more!

Project Summary

This library provides a lightweight framework for evaluating language models, focusing on zero-shot, chain-of-thought prompting for realistic performance assessment. It targets researchers and developers needing transparent, reproducible benchmarks for LLM accuracy, offering a curated set of standard evaluations.

How It Works

The library emphasizes simple, direct instructions for evaluations, avoiding complex few-shot or role-playing prompts that can skew results for instruction-tuned models. This approach aims to better reflect real-world usage and model capabilities in a zero-shot setting.

Quick Start & Requirements

Install: pip install openai (for OpenAI API), pip install anthropic (for Anthropic API).
Prerequisites: API keys for supported models (OpenAI, Claude).
Running evals: python -m simple_evals.simple_evals --model <model_name> --examples <num_examples>
Additional setup for HumanEval: git clone https://github.com/openai/human-eval and pip install -e human-eval.
More details: OpenAI API Docs, Anthropic API.

Highlighted Details

Includes benchmarks for MMLU, GPQA, MATH, HumanEval, DROP, MGSM, and SimpleQA.
Supports evaluation of OpenAI models (o1, o3, o4-mini, o3-mini, GPT-4 variants) and other models like Claude and Llama.
Focuses on zero-shot, chain-of-thought prompting for evaluation.
Not intended as a replacement for the more comprehensive openai/evals repository.

Maintenance & Community

The repository states it will not be actively maintained, with limited acceptance of PRs for bug fixes, adding model adapters, or new eval results.

Licensing & Compatibility

Evals and data contributed are under the MIT License.
OpenAI reserves the right to use contributed data for service improvements.
Contributions are subject to OpenAI's Usage Policies.

Limitations & Caveats

This repository is not actively maintained and will not accept new evals. Some benchmarks (MGSM, DROP) may be saturated for newer models. Results for newer models on MATH use a newer, IID version (MATH-500).

Health Check

Last Commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

69 stars in the last 30 days