frontier-evals by openai

Code repo for OpenAI preparedness evals

Created 11 months ago

1,088 stars

Top 34.7% on SourcePulse

6 Experts Love This Project

willccbb

Research Lead at Prime Intellect

swyxio

Editor of Latent Space

shizhediao

Author of LMFlow; Research Scientist at NVIDIA

JustinLin610

Core Maintainer at Alibaba Qwen

and 2 more!

Project Summary

This repository provides code for multiple "Preparedness" evaluations, leveraging the nanoeval and alcatraz frameworks. It is intended for researchers and developers working on evaluating AI system preparedness, offering a structured approach to benchmarking.

How It Works

The project utilizes the nanoeval and alcatraz libraries to define and execute evaluation benchmarks. This modular approach allows for the creation of diverse evaluation suites, such as PaperBench and forthcoming MLE-bench, facilitating systematic assessment of AI capabilities.

Quick Start & Requirements

Primary install: pip install -e project/<project_name> for nanoeval, alcatraz, and nanoeval_alcatraz.
Prerequisites: Python 3.11 (3.12 untested, 3.13 will break chz).
Setup: Requires running a bash script to install project dependencies.

Highlighted Details

Supports multiple evaluation frameworks including nanoeval and alcatraz.
Includes PaperBench evaluation.
Forthcoming evaluations: SWELancer and MLE-bench.

Maintenance & Community

No specific community channels or maintenance details are provided in the README.

Licensing & Compatibility

The repository does not specify a license.

Limitations & Caveats

Python 3.12 is untested, and Python 3.13 is known to break the chz dependency. The repository is focused on specific evaluation frameworks and does not offer broader AI development tools.

Health Check

Last Commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)

1

Issues (30d)

1

Star History

107 stars in the last 30 days

Explore Similar Projects

T-Eval by open-compass

Evaluation harness for LLM tool use, step-by-step

Created 2 years ago

Updated 1 year ago

Starred by

Elie Bursztein

Elie Bursztein(Cybersecurity Lead at Google DeepMind).

athina-evals by athina-ai

Python SDK for LLM response evaluation

Created 2 years ago

Updated 8 months ago

Starred by

Ishaan Jaffer

Ishaan Jaffer(Cofounder of LiteLLM),

Maxime Labonne

Maxime Labonne(Head of Post-Training at Liquid AI), and

2 more.

can-ai-code by the-crypt-keeper

AI coding model evaluation framework

Created 2 years ago

Updated 8 months ago

Starred by

Jeremy Howard

Jeremy Howard(Cofounder of fast.ai).

code-eval by abacaj

Evaluation harness for LLMs using the HumanEval benchmark

Created 2 years ago

Updated 2 years ago

prompt-forge by insaaniManav

AI prompt engineering workbench

Created 8 months ago

Updated 7 months ago

Starred by

Michael Chiang

Michael Chiang(Cofounder of Ollama),

Maxime Labonne

Maxime Labonne(Head of Post-Training at Liquid AI), and

6 more.

openbench by groq

Provider-agnostic LLM evaluation infrastructure

Created 6 months ago

Updated 2 months ago

Starred by

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA),

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and

3 more.

evalplus by evalplus

LLM code evaluation framework for rigorous testing

Created 2 years ago

Updated 4 months ago

Starred by

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA),

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI), and

3 more.

promptbench by microsoft

LLM evaluation framework

Created 2 years ago

Updated 5 days ago

Starred by

Elie Bursztein

Elie Bursztein(Cybersecurity Lead at Google DeepMind),

Bryan Helmig

Bryan Helmig(Cofounder of Zapier), and

7 more.

ChainForge by ianarawjo

Visual environment for LLM prompt battle-testing

Created 2 years ago

Updated 1 month ago

Starred by

Morgan Funtowicz

Morgan Funtowicz(Head of ML Optimizations at Hugging Face),

Luis Capelo

Luis Capelo(Cofounder of Lightning AI), and

8 more.

lighteval by huggingface

LLM evaluation toolkit for multiple backends

Created 2 years ago

Updated 5 days ago

Starred by

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral),

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI), and

15 more.

SWE-bench by SWE-bench

Benchmark for evaluating LLMs on real-world GitHub issues

Created 2 years ago

Updated 6 days ago

Starred by

Anastasios Angelopoulos

Anastasios Angelopoulos(Cofounder of LMArena),

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and

35 more.

evals by openai

Framework for evaluating LLMs and LLM systems, plus benchmark registry

Created 3 years ago

Updated 3 months ago

Feedback? Help us improve.