simple-evals  by openai

Lightweight library for evaluating language models

Created 2 years ago
4,502 stars

Top 10.9% on SourcePulse

GitHubView on GitHub
Project Summary

This library provides a lightweight framework for evaluating language models, focusing on zero-shot, chain-of-thought prompting for realistic performance assessment. It targets researchers and developers needing transparent, reproducible benchmarks for LLM accuracy, offering a curated set of standard evaluations.

How It Works

The library emphasizes simple, direct instructions for evaluations, avoiding complex few-shot or role-playing prompts that can skew results for instruction-tuned models. This approach aims to better reflect real-world usage and model capabilities in a zero-shot setting.

Quick Start & Requirements

  • Install: pip install openai (for OpenAI API), pip install anthropic (for Anthropic API).
  • Prerequisites: API keys for supported models (OpenAI, Claude).
  • Running evals: python -m simple_evals.simple_evals --model <model_name> --examples <num_examples>
  • Additional setup for HumanEval: git clone https://github.com/openai/human-eval and pip install -e human-eval.
  • More details: OpenAI API Docs, Anthropic API.

Highlighted Details

  • Includes benchmarks for MMLU, GPQA, MATH, HumanEval, DROP, MGSM, and SimpleQA.
  • Supports evaluation of OpenAI models (o1, o3, o4-mini, o3-mini, GPT-4 variants) and other models like Claude and Llama.
  • Focuses on zero-shot, chain-of-thought prompting for evaluation.
  • Not intended as a replacement for the more comprehensive openai/evals repository.

Maintenance & Community

The repository states it will not be actively maintained, with limited acceptance of PRs for bug fixes, adding model adapters, or new eval results.

Licensing & Compatibility

  • Evals and data contributed are under the MIT License.
  • OpenAI reserves the right to use contributed data for service improvements.
  • Contributions are subject to OpenAI's Usage Policies.

Limitations & Caveats

This repository is not actively maintained and will not accept new evals. Some benchmarks (MGSM, DROP) may be saturated for newer models. Results for newer models on MATH use a newer, IID version (MATH-500).

Health Check
Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
2
Issues (30d)
1
Star History
47 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Gabriel Almeida Gabriel Almeida(Cofounder of Langflow), and
2 more.

llm_rules by normster

0.8%
253
Benchmark for evaluating LLM rule-following capabilities
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Didier Lopes Didier Lopes(Founder of OpenBB), and
2 more.

RULER by NVIDIA

0.4%
2k
Evaluation suite for long-context language models research paper
Created 2 years ago
Updated 6 months ago
Feedback? Help us improve.