frontier-evals  by openai

Code repo for OpenAI preparedness evals

Created 9 months ago
974 stars

Top 37.9% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides code for multiple "Preparedness" evaluations, leveraging the nanoeval and alcatraz frameworks. It is intended for researchers and developers working on evaluating AI system preparedness, offering a structured approach to benchmarking.

How It Works

The project utilizes the nanoeval and alcatraz libraries to define and execute evaluation benchmarks. This modular approach allows for the creation of diverse evaluation suites, such as PaperBench and forthcoming MLE-bench, facilitating systematic assessment of AI capabilities.

Quick Start & Requirements

  • Primary install: pip install -e project/<project_name> for nanoeval, alcatraz, and nanoeval_alcatraz.
  • Prerequisites: Python 3.11 (3.12 untested, 3.13 will break chz).
  • Setup: Requires running a bash script to install project dependencies.

Highlighted Details

  • Supports multiple evaluation frameworks including nanoeval and alcatraz.
  • Includes PaperBench evaluation.
  • Forthcoming evaluations: SWELancer and MLE-bench.

Maintenance & Community

No specific community channels or maintenance details are provided in the README.

Licensing & Compatibility

The repository does not specify a license.

Limitations & Caveats

Python 3.12 is untested, and Python 3.13 is known to break the chz dependency. The repository is focused on specific evaluation frameworks and does not offer broader AI development tools.

Health Check
Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
2
Star History
16 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
3 more.

evalplus by evalplus

0.2%
2k
LLM code evaluation framework for rigorous testing
Created 2 years ago
Updated 3 months ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), and
3 more.

promptbench by microsoft

0.1%
3k
LLM evaluation framework
Created 2 years ago
Updated 3 months ago
Starred by Morgan Funtowicz Morgan Funtowicz(Head of ML Optimizations at Hugging Face), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
8 more.

lighteval by huggingface

0.5%
2k
LLM evaluation toolkit for multiple backends
Created 1 year ago
Updated 3 days ago
Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), and
15 more.

SWE-bench by SWE-bench

0.8%
4k
Benchmark for evaluating LLMs on real-world GitHub issues
Created 2 years ago
Updated 6 days ago
Starred by Anastasios Angelopoulos Anastasios Angelopoulos(Cofounder of LMArena), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
35 more.

evals by openai

0.1%
18k
Framework for evaluating LLMs and LLM systems, plus benchmark registry
Created 3 years ago
Updated 2 months ago
Feedback? Help us improve.