Benchmark for evaluating LLMs on real-world GitHub issues
Top 15.2% on sourcepulse
SWE-bench is a benchmark dataset and evaluation harness designed to assess the ability of large language models (LLMs) to resolve real-world software engineering issues sourced from GitHub. It targets AI researchers and developers building LLM-based code generation and debugging tools, offering a standardized method to measure model performance on complex, practical tasks.
How It Works
The benchmark provides a collection of GitHub repositories paired with specific issues. LLMs are tasked with generating a code patch that resolves the described issue. The evaluation harness utilizes Docker to create reproducible environments for testing these patches against the original codebases, ensuring consistent and reliable assessment of model capabilities.
Quick Start & Requirements
pip install -e .
after cloning the repository.Highlighted Details
sb-cli
(AWS) and Modal.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
4 days ago
1 week