SWE-bench  by SWE-bench

Benchmark for evaluating LLMs on real-world GitHub issues

created 1 year ago
3,246 stars

Top 15.2% on sourcepulse

GitHubView on GitHub
Project Summary

SWE-bench is a benchmark dataset and evaluation harness designed to assess the ability of large language models (LLMs) to resolve real-world software engineering issues sourced from GitHub. It targets AI researchers and developers building LLM-based code generation and debugging tools, offering a standardized method to measure model performance on complex, practical tasks.

How It Works

The benchmark provides a collection of GitHub repositories paired with specific issues. LLMs are tasked with generating a code patch that resolves the described issue. The evaluation harness utilizes Docker to create reproducible environments for testing these patches against the original codebases, ensuring consistent and reliable assessment of model capabilities.

Quick Start & Requirements

  • Install: pip install -e . after cloning the repository.
  • Prerequisites: Docker is required for reproducible evaluations. Recommended: x86_64 architecture, 120GB free storage, 16GB RAM, 8 CPU cores. Experimental support for arm64.
  • Setup: Follow the Docker setup guide.
  • Docs: Read the Docs

Highlighted Details

  • Includes SWE-bench Multimodal for evaluating generalization to visual software domains.
  • SWE-bench Verified offers a subset of 500 problems confirmed solvable by human engineers.
  • Supports cloud-based evaluations via sb-cli (AWS) and Modal.
  • Provides pre-processed datasets for training custom models and running inference.

Maintenance & Community

  • Active development with recent updates including multimodal support and containerized evaluation.
  • Contact: Carlos E. Jimenez (carlosej@princeton.edu), John Yang (johnby@stanford.edu).
  • Contributions and pull requests are welcomed.

Licensing & Compatibility

  • MIT License. Compatible with commercial use and closed-source linking.

Limitations & Caveats

  • Evaluation can be resource-intensive. Support for creating new SWE-bench instances is temporarily paused.
Health Check
Last commit

4 days ago

Responsiveness

1 week

Pull Requests (30d)
6
Issues (30d)
20
Star History
396 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley).

SWE-Gym by SWE-Gym

1.0%
513
Environment for training software engineering agents
created 9 months ago
updated 4 days ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
10 more.

open-r1 by huggingface

0.2%
25k
SDK for reproducing DeepSeek-R1
created 6 months ago
updated 3 days ago
Feedback? Help us improve.