simple-evals  by openai

Lightweight library for evaluating language models

Created 1 year ago
4,057 stars

Top 12.1% on SourcePulse

GitHubView on GitHub
Project Summary

This library provides a lightweight framework for evaluating language models, focusing on zero-shot, chain-of-thought prompting for realistic performance assessment. It targets researchers and developers needing transparent, reproducible benchmarks for LLM accuracy, offering a curated set of standard evaluations.

How It Works

The library emphasizes simple, direct instructions for evaluations, avoiding complex few-shot or role-playing prompts that can skew results for instruction-tuned models. This approach aims to better reflect real-world usage and model capabilities in a zero-shot setting.

Quick Start & Requirements

  • Install: pip install openai (for OpenAI API), pip install anthropic (for Anthropic API).
  • Prerequisites: API keys for supported models (OpenAI, Claude).
  • Running evals: python -m simple_evals.simple_evals --model <model_name> --examples <num_examples>
  • Additional setup for HumanEval: git clone https://github.com/openai/human-eval and pip install -e human-eval.
  • More details: OpenAI API Docs, Anthropic API.

Highlighted Details

  • Includes benchmarks for MMLU, GPQA, MATH, HumanEval, DROP, MGSM, and SimpleQA.
  • Supports evaluation of OpenAI models (o1, o3, o4-mini, o3-mini, GPT-4 variants) and other models like Claude and Llama.
  • Focuses on zero-shot, chain-of-thought prompting for evaluation.
  • Not intended as a replacement for the more comprehensive openai/evals repository.

Maintenance & Community

The repository states it will not be actively maintained, with limited acceptance of PRs for bug fixes, adding model adapters, or new eval results.

Licensing & Compatibility

  • Evals and data contributed are under the MIT License.
  • OpenAI reserves the right to use contributed data for service improvements.
  • Contributions are subject to OpenAI's Usage Policies.

Limitations & Caveats

This repository is not actively maintained and will not accept new evals. Some benchmarks (MGSM, DROP) may be saturated for newer models. Results for newer models on MATH use a newer, IID version (MATH-500).

Health Check
Last Commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
92 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Didier Lopes Didier Lopes(Founder of OpenBB), and
2 more.

RULER by NVIDIA

0.8%
1k
Evaluation suite for long-context language models research paper
Created 1 year ago
Updated 1 month ago
Feedback? Help us improve.