bigcodebench by bigcode-project

Code benchmark for evaluating LLMs on practical software engineering tasks

Created 1 year ago

468 stars

Top 64.9% on SourcePulse

Project Summary

BigCodeBench is a benchmark suite designed to evaluate the code generation capabilities of Large Language Models (LLMs) on practical and challenging software engineering tasks. It targets researchers and developers working on AI for software engineering, providing a standardized way to assess LLM performance on complex instructions and diverse function calls, aiming to move beyond simpler HumanEval-like tasks.

How It Works

BigCodeBench offers two evaluation splits: "Complete" for code completion based on comprehensive docstrings, and "Instruct" for instruction-tuned models requiring more complex reasoning from natural language prompts. The benchmark facilitates evaluation through a remote API, supporting various backends like vLLM, OpenAI, and Hugging Face, with options for different execution environments (e.g., e2b, gradio). This approach allows for reproducible and scalable benchmarking, with pre-generated samples available to accelerate research.

Quick Start & Requirements

Install: pip install bigcodebench --upgrade
Prerequisites: flash-attn (recommended for code generation), packaging, ninja. API keys for chosen backends (E2B, OpenAI, Anthropic, Mistral, Google, Hugging Face).
Links: Official PyPI, Hugging Face Leaderboard, BigCodeBench-Hard Dataset.

Highlighted Details

Evaluated 163 models as of January 2025.
Includes "BigCodeBench-Hard" with 148 tasks aligned with real-world programming.
Offers a public code execution API on Hugging Face Spaces.
Provides pre-generated LLM samples for various models to expedite research.

Maintenance & Community

The project is actively maintained with frequent releases and has been adopted by numerous major LLM teams. A Hugging Face leaderboard is available for tracking model performance.

Licensing & Compatibility

The project appears to be open-source, but the specific license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification of the license terms.

Limitations & Caveats

The README notes that batch inference results can vary and recommends setting batch size to 1 for more deterministic greedy decoding. Remote evaluation backends like Gradio and E2B can be slow on default machines. Base models with tokenizer.chat_template might require the --direct_completion flag to avoid chat mode evaluation.

Health Check

Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

13 stars in the last 30 days