simple-evals  by openai

Lightweight library for evaluating language models

created 1 year ago
3,886 stars

Top 12.8% on sourcepulse

GitHubView on GitHub
Project Summary

This library provides a lightweight framework for evaluating language models, focusing on zero-shot, chain-of-thought prompting for realistic performance assessment. It targets researchers and developers needing transparent, reproducible benchmarks for LLM accuracy, offering a curated set of standard evaluations.

How It Works

The library emphasizes simple, direct instructions for evaluations, avoiding complex few-shot or role-playing prompts that can skew results for instruction-tuned models. This approach aims to better reflect real-world usage and model capabilities in a zero-shot setting.

Quick Start & Requirements

  • Install: pip install openai (for OpenAI API), pip install anthropic (for Anthropic API).
  • Prerequisites: API keys for supported models (OpenAI, Claude).
  • Running evals: python -m simple_evals.simple_evals --model <model_name> --examples <num_examples>
  • Additional setup for HumanEval: git clone https://github.com/openai/human-eval and pip install -e human-eval.
  • More details: OpenAI API Docs, Anthropic API.

Highlighted Details

  • Includes benchmarks for MMLU, GPQA, MATH, HumanEval, DROP, MGSM, and SimpleQA.
  • Supports evaluation of OpenAI models (o1, o3, o4-mini, o3-mini, GPT-4 variants) and other models like Claude and Llama.
  • Focuses on zero-shot, chain-of-thought prompting for evaluation.
  • Not intended as a replacement for the more comprehensive openai/evals repository.

Maintenance & Community

The repository states it will not be actively maintained, with limited acceptance of PRs for bug fixes, adding model adapters, or new eval results.

Licensing & Compatibility

  • Evals and data contributed are under the MIT License.
  • OpenAI reserves the right to use contributed data for service improvements.
  • Contributions are subject to OpenAI's Usage Policies.

Limitations & Caveats

This repository is not actively maintained and will not accept new evals. Some benchmarks (MGSM, DROP) may be saturated for newer models. Results for newer models on MATH use a newer, IID version (MATH-500).

Health Check
Last commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)
2
Issues (30d)
0
Star History
1,166 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.