preparedness  by openai

Code repo for OpenAI preparedness evals

Created 5 months ago
861 stars

Top 41.6% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides code for multiple "Preparedness" evaluations, leveraging the nanoeval and alcatraz frameworks. It is intended for researchers and developers working on evaluating AI system preparedness, offering a structured approach to benchmarking.

How It Works

The project utilizes the nanoeval and alcatraz libraries to define and execute evaluation benchmarks. This modular approach allows for the creation of diverse evaluation suites, such as PaperBench and forthcoming MLE-bench, facilitating systematic assessment of AI capabilities.

Quick Start & Requirements

  • Primary install: pip install -e project/<project_name> for nanoeval, alcatraz, and nanoeval_alcatraz.
  • Prerequisites: Python 3.11 (3.12 untested, 3.13 will break chz).
  • Setup: Requires running a bash script to install project dependencies.

Highlighted Details

  • Supports multiple evaluation frameworks including nanoeval and alcatraz.
  • Includes PaperBench evaluation.
  • Forthcoming evaluations: SWELancer and MLE-bench.

Maintenance & Community

No specific community channels or maintenance details are provided in the README.

Licensing & Compatibility

The repository does not specify a license.

Limitations & Caveats

Python 3.12 is untested, and Python 3.13 is known to break the chz dependency. The repository is focused on specific evaluation frameworks and does not offer broader AI development tools.

Health Check
Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)
5
Issues (30d)
6
Star History
26 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Binyuan Hui Binyuan Hui(Research Scientist at Alibaba Qwen), and
2 more.

evalplus by evalplus

0.3%
2k
LLM code evaluation framework for rigorous testing
Created 2 years ago
Updated 4 weeks ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), and
3 more.

promptbench by microsoft

0.1%
3k
LLM evaluation framework
Created 2 years ago
Updated 1 month ago
Starred by Morgan Funtowicz Morgan Funtowicz(Head of ML Optimizations at Hugging Face), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
7 more.

lighteval by huggingface

2.6%
2k
LLM evaluation toolkit for multiple backends
Created 1 year ago
Updated 1 day ago
Starred by Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
14 more.

SWE-bench by SWE-bench

2.3%
4k
Benchmark for evaluating LLMs on real-world GitHub issues
Created 1 year ago
Updated 18 hours ago
Starred by Anastasios Angelopoulos Anastasios Angelopoulos(Cofounder of LMArena), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
34 more.

evals by openai

0.2%
17k
Framework for evaluating LLMs and LLM systems, plus benchmark registry
Created 2 years ago
Updated 9 months ago
Feedback? Help us improve.