yet-another-applied-llm-benchmark  by carlini

LLM benchmark for evaluating models on previously asked programming questions

Created 1 year ago
1,031 stars

Top 36.4% on SourcePulse

GitHubView on GitHub
Project Summary

This benchmark evaluates language models on real-world programming tasks the author has encountered. It's designed for developers and researchers who need to assess LLM capabilities beyond standard academic metrics, focusing on practical problem-solving and code generation.

How It Works

The benchmark utilizes a custom dataflow Domain Specific Language (DSL) to chain operations: prompt an LLM, execute the generated code (within a Docker container), and evaluate the output. This approach allows for complex, multi-step evaluations, including using another LLM to judge code output or comparing generated images against reference solutions. The DSL is designed for ease of adding new, realistic tests.

Quick Start & Requirements

  • Install dependencies: pip install -r requirements.txt and optionally pip install -r requirements-extra.txt.
  • Containerization: Podman (preferred) or Docker is required for secure code execution.
  • API Keys: Configure LLM API keys in config.json. OpenAI API keys are required for secondary evaluations.
  • Chrome: Required for specific test cases involving HTML/JavaScript generation.
  • Run benchmark: python main.py --model <model_name> --run-tests --generate-report.
  • Additional resources: run_a_simple_testcase.ipynb

Highlighted Details

  • Evaluates LLMs on tasks like Python to C conversion, bytecode decompilation, minified JavaScript explanation, and SQL generation.
  • Features a DSL for creating complex, verifiable test pipelines.
  • Includes benchmark results for several leading LLMs (e.g., Claude 3.5 Sonnet, GPT-4o, Gemini 1.5 Pro).
  • Emphasizes practical utility over academic rigor, with tests derived from actual user prompts.

Maintenance & Community

  • Primarily maintained by Nicholas Carlini.
  • Open to contributions via Pull Requests for new tests.

Licensing & Compatibility

  • Licensed under the GNU General Public License v3 or later.
  • This license may impose copyleft restrictions on derivative works, potentially impacting commercial or closed-source integration.

Limitations & Caveats

The benchmark is explicitly not intended for rigorous academic comparison or determining which model is "better" overall, as prompts are not optimized and test cases may be ambiguous or rely on recent knowledge. Failing a test provides limited insight, whereas passing demonstrates specific, verifiable capabilities.

Health Check
Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Binyuan Hui Binyuan Hui(Research Scientist at Alibaba Qwen), and
2 more.

evalplus by evalplus

0.3%
2k
LLM code evaluation framework for rigorous testing
Created 2 years ago
Updated 1 month ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Jared Palmer Jared Palmer(Ex-VP AI at Vercel; Founder of Turborepo; Author of Formik, TSDX), and
3 more.

human-eval by openai

0.4%
3k
Evaluation harness for LLMs trained on code
Created 4 years ago
Updated 8 months ago
Starred by Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
14 more.

SWE-bench by SWE-bench

2.3%
4k
Benchmark for evaluating LLMs on real-world GitHub issues
Created 1 year ago
Updated 20 hours ago
Feedback? Help us improve.