yet-another-applied-llm-benchmark  by carlini

LLM benchmark for evaluating models on previously asked programming questions

created 1 year ago
1,023 stars

Top 37.2% on sourcepulse

GitHubView on GitHub
Project Summary

This benchmark evaluates language models on real-world programming tasks the author has encountered. It's designed for developers and researchers who need to assess LLM capabilities beyond standard academic metrics, focusing on practical problem-solving and code generation.

How It Works

The benchmark utilizes a custom dataflow Domain Specific Language (DSL) to chain operations: prompt an LLM, execute the generated code (within a Docker container), and evaluate the output. This approach allows for complex, multi-step evaluations, including using another LLM to judge code output or comparing generated images against reference solutions. The DSL is designed for ease of adding new, realistic tests.

Quick Start & Requirements

  • Install dependencies: pip install -r requirements.txt and optionally pip install -r requirements-extra.txt.
  • Containerization: Podman (preferred) or Docker is required for secure code execution.
  • API Keys: Configure LLM API keys in config.json. OpenAI API keys are required for secondary evaluations.
  • Chrome: Required for specific test cases involving HTML/JavaScript generation.
  • Run benchmark: python main.py --model <model_name> --run-tests --generate-report.
  • Additional resources: run_a_simple_testcase.ipynb

Highlighted Details

  • Evaluates LLMs on tasks like Python to C conversion, bytecode decompilation, minified JavaScript explanation, and SQL generation.
  • Features a DSL for creating complex, verifiable test pipelines.
  • Includes benchmark results for several leading LLMs (e.g., Claude 3.5 Sonnet, GPT-4o, Gemini 1.5 Pro).
  • Emphasizes practical utility over academic rigor, with tests derived from actual user prompts.

Maintenance & Community

  • Primarily maintained by Nicholas Carlini.
  • Open to contributions via Pull Requests for new tests.

Licensing & Compatibility

  • Licensed under the GNU General Public License v3 or later.
  • This license may impose copyleft restrictions on derivative works, potentially impacting commercial or closed-source integration.

Limitations & Caveats

The benchmark is explicitly not intended for rigorous academic comparison or determining which model is "better" overall, as prompts are not optimized and test cases may be ambiguous or rely on recent knowledge. Failing a test provides limited insight, whereas passing demonstrates specific, verifiable capabilities.

Health Check
Last commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
18 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Wei-Lin Chiang Wei-Lin Chiang(Cofounder of LMArena).

evalplus by evalplus

0.5%
2k
LLM code evaluation framework for rigorous testing
created 2 years ago
updated 4 weeks ago
Feedback? Help us improve.