yet-another-applied-llm-benchmark by carlini

LLM benchmark for evaluating models on previously asked programming questions

Created 2 years ago

1,040 stars

Top 36.1% on SourcePulse

View on GitHub

7 Experts Love This Project

Andrej Karpathy

Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n

Edward Z. Yang

Research Engineer at Meta; Maintainer of PyTorch

Patrick von Platen

Author of Hugging Face Diffusers; Research Engineer at Mistral

Simon Willison

Coauthor of Django

and 3 more!

Project Summary

This benchmark evaluates language models on real-world programming tasks the author has encountered. It's designed for developers and researchers who need to assess LLM capabilities beyond standard academic metrics, focusing on practical problem-solving and code generation.

How It Works

The benchmark utilizes a custom dataflow Domain Specific Language (DSL) to chain operations: prompt an LLM, execute the generated code (within a Docker container), and evaluate the output. This approach allows for complex, multi-step evaluations, including using another LLM to judge code output or comparing generated images against reference solutions. The DSL is designed for ease of adding new, realistic tests.

Quick Start & Requirements

Install dependencies: pip install -r requirements.txt and optionally pip install -r requirements-extra.txt.
Containerization: Podman (preferred) or Docker is required for secure code execution.
API Keys: Configure LLM API keys in config.json. OpenAI API keys are required for secondary evaluations.
Chrome: Required for specific test cases involving HTML/JavaScript generation.
Run benchmark: python main.py --model <model_name> --run-tests --generate-report.
Additional resources: run_a_simple_testcase.ipynb

Highlighted Details

Evaluates LLMs on tasks like Python to C conversion, bytecode decompilation, minified JavaScript explanation, and SQL generation.
Features a DSL for creating complex, verifiable test pipelines.
Includes benchmark results for several leading LLMs (e.g., Claude 3.5 Sonnet, GPT-4o, Gemini 1.5 Pro).
Emphasizes practical utility over academic rigor, with tests derived from actual user prompts.

Maintenance & Community

Primarily maintained by Nicholas Carlini.
Open to contributions via Pull Requests for new tests.

Licensing & Compatibility

Licensed under the GNU General Public License v3 or later.
This license may impose copyleft restrictions on derivative works, potentially impacting commercial or closed-source integration.

Limitations & Caveats

The benchmark is explicitly not intended for rigorous academic comparison or determining which model is "better" overall, as prompts are not optimized and test cases may be ambiguous or rely on recent knowledge. Failing a test provides limited insight, whereas passing demonstrates specific, verifiable capabilities.

Health Check

Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days