Code benchmark for evaluating LLMs on practical software engineering tasks
Top 73.2% on sourcepulse
BigCodeBench is a benchmark suite designed to evaluate the code generation capabilities of Large Language Models (LLMs) on practical and challenging software engineering tasks. It targets researchers and developers working on AI for software engineering, providing a standardized way to assess LLM performance on complex instructions and diverse function calls, aiming to move beyond simpler HumanEval-like tasks.
How It Works
BigCodeBench offers two evaluation splits: "Complete" for code completion based on comprehensive docstrings, and "Instruct" for instruction-tuned models requiring more complex reasoning from natural language prompts. The benchmark facilitates evaluation through a remote API, supporting various backends like vLLM, OpenAI, and Hugging Face, with options for different execution environments (e.g., e2b, gradio). This approach allows for reproducible and scalable benchmarking, with pre-generated samples available to accelerate research.
Quick Start & Requirements
pip install bigcodebench --upgrade
flash-attn
(recommended for code generation), packaging
, ninja
. API keys for chosen backends (E2B, OpenAI, Anthropic, Mistral, Google, Hugging Face).Highlighted Details
Maintenance & Community
The project is actively maintained with frequent releases and has been adopted by numerous major LLM teams. A Hugging Face leaderboard is available for tracking model performance.
Licensing & Compatibility
The project appears to be open-source, but the specific license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification of the license terms.
Limitations & Caveats
The README notes that batch inference results can vary and recommends setting batch size to 1 for more deterministic greedy decoding. Remote evaluation backends like Gradio and E2B can be slow on default machines. Base models with tokenizer.chat_template
might require the --direct_completion
flag to avoid chat mode evaluation.
3 months ago
1 day