bigcodebench  by bigcode-project

Code benchmark for evaluating LLMs on practical software engineering tasks

Created 1 year ago
422 stars

Top 69.8% on SourcePulse

GitHubView on GitHub
Project Summary

BigCodeBench is a benchmark suite designed to evaluate the code generation capabilities of Large Language Models (LLMs) on practical and challenging software engineering tasks. It targets researchers and developers working on AI for software engineering, providing a standardized way to assess LLM performance on complex instructions and diverse function calls, aiming to move beyond simpler HumanEval-like tasks.

How It Works

BigCodeBench offers two evaluation splits: "Complete" for code completion based on comprehensive docstrings, and "Instruct" for instruction-tuned models requiring more complex reasoning from natural language prompts. The benchmark facilitates evaluation through a remote API, supporting various backends like vLLM, OpenAI, and Hugging Face, with options for different execution environments (e.g., e2b, gradio). This approach allows for reproducible and scalable benchmarking, with pre-generated samples available to accelerate research.

Quick Start & Requirements

Highlighted Details

  • Evaluated 163 models as of January 2025.
  • Includes "BigCodeBench-Hard" with 148 tasks aligned with real-world programming.
  • Offers a public code execution API on Hugging Face Spaces.
  • Provides pre-generated LLM samples for various models to expedite research.

Maintenance & Community

The project is actively maintained with frequent releases and has been adopted by numerous major LLM teams. A Hugging Face leaderboard is available for tracking model performance.

Licensing & Compatibility

The project appears to be open-source, but the specific license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification of the license terms.

Limitations & Caveats

The README notes that batch inference results can vary and recommends setting batch size to 1 for more deterministic greedy decoding. Remote evaluation backends like Gradio and E2B can be slow on default machines. Base models with tokenizer.chat_template might require the --direct_completion flag to avoid chat mode evaluation.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
3
Issues (30d)
1
Star History
10 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Binyuan Hui Binyuan Hui(Research Scientist at Alibaba Qwen), and
2 more.

evalplus by evalplus

0.3%
2k
LLM code evaluation framework for rigorous testing
Created 2 years ago
Updated 1 month ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Jared Palmer Jared Palmer(Ex-VP AI at Vercel; Founder of Turborepo; Author of Formik, TSDX), and
3 more.

human-eval by openai

0.4%
3k
Evaluation harness for LLMs trained on code
Created 4 years ago
Updated 8 months ago
Starred by Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
14 more.

SWE-bench by SWE-bench

2.3%
4k
Benchmark for evaluating LLMs on real-world GitHub issues
Created 1 year ago
Updated 20 hours ago
Feedback? Help us improve.