SWE-bench by SWE-bench

Benchmark for evaluating LLMs on real-world GitHub issues

Created 2 years ago

4,342 stars

Top 11.2% on SourcePulse

17 Experts Love This Project

patrickvonplaten

Patrick von Platen

Author of Hugging Face Diffusers; Research Engineer at Mistral

pgarbacki

Cofounder of Fireworks AI

shizhediao

Author of LMFlow; Research Scientist at NVIDIA

transitive-bullshit

Founder of Agentic

and 13 more!

Project Summary

SWE-bench is a benchmark dataset and evaluation harness designed to assess the ability of large language models (LLMs) to resolve real-world software engineering issues sourced from GitHub. It targets AI researchers and developers building LLM-based code generation and debugging tools, offering a standardized method to measure model performance on complex, practical tasks.

How It Works

The benchmark provides a collection of GitHub repositories paired with specific issues. LLMs are tasked with generating a code patch that resolves the described issue. The evaluation harness utilizes Docker to create reproducible environments for testing these patches against the original codebases, ensuring consistent and reliable assessment of model capabilities.

Quick Start & Requirements

Install: pip install -e . after cloning the repository.
Prerequisites: Docker is required for reproducible evaluations. Recommended: x86_64 architecture, 120GB free storage, 16GB RAM, 8 CPU cores. Experimental support for arm64.
Setup: Follow the Docker setup guide.
Docs: Read the Docs

Highlighted Details

Includes SWE-bench Multimodal for evaluating generalization to visual software domains.
SWE-bench Verified offers a subset of 500 problems confirmed solvable by human engineers.
Supports cloud-based evaluations via sb-cli (AWS) and Modal.
Provides pre-processed datasets for training custom models and running inference.

Maintenance & Community

Active development with recent updates including multimodal support and containerized evaluation.
Contact: Carlos E. Jimenez (carlosej@princeton.edu), John Yang (johnby@stanford.edu).
Contributions and pull requests are welcomed.

Licensing & Compatibility

MIT License. Compatible with commercial use and closed-source linking.

Limitations & Caveats

Evaluation can be resource-intensive. Support for creating new SWE-bench instances is temporarily paused.

Health Check

Last Commit

6 days ago

Responsiveness

1 week

Pull Requests (30d)

3

Issues (30d)

6

Star History

169 stars in the last 30 days

Explore Similar Projects

ML-Bench by gersteinlab

Benchmark for evaluating LLMs/agents on machine learning tasks using repository-level code

Created 2 years ago

Updated 6 months ago

Starred by

Georgios Konstantopoulos

Georgios Konstantopoulos(CTO, General Partner at Paradigm).

supercharger by catid

CLI tool for LLM-powered code generation and unit testing

Created 2 years ago

Updated 2 years ago

DyCodeEval by SeekingDream

Dynamic benchmarking for code LLMs

Created 8 months ago

Updated 2 months ago

Starred by

Eric Zhu

Eric Zhu(Coauthor of AutoGen; Research Scientist at Microsoft Research) and

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA).

DS-1000 by xlang-ai

Benchmark for data science code generation

Created 3 years ago

Updated 1 year ago

Starred by

Jeremy Howard

Jeremy Howard(Cofounder of fast.ai).

code-eval by abacaj

Evaluation harness for LLMs using the HumanEval benchmark

Created 2 years ago

Updated 2 years ago

Starred by

Ying Sheng

Ying Sheng(Coauthor of SGLang).

MultiPL-E by nuprl

Benchmark for evaluating code generation LLMs across multiple programming languages

Created 3 years ago

Updated 4 weeks ago

bigcodebench by bigcode-project

Code benchmark for evaluating LLMs on practical software engineering tasks

Created 1 year ago

Updated 1 month ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera).

olmes by allenai

LLM evaluation system for reproducible research

Created 1 year ago

Updated 4 weeks ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera),

Will Brown

Will Brown(Research Lead at Prime Intellect), and

4 more.

evalchemy by mlfoundations

LLM evaluation toolkit for post-trained language models

Created 1 year ago

Updated 2 months ago

Starred by

Yiran Wu

Yiran Wu(Coauthor of AutoGen),

Vincent Weisser

Vincent Weisser(Cofounder of Prime Intellect), and

2 more.

SWELancer-Benchmark by openai

Benchmark for evaluating LLMs in real-world software engineering tasks

Created 1 year ago

Updated 7 months ago

Starred by

Binyuan Hui

Binyuan Hui(Research Scientist at Alibaba Qwen),

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI), and

3 more.

LiveCodeBench by LiveCodeBench

Benchmark for holistic LLM code evaluation

Created 1 year ago

Updated 7 months ago

Starred by

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA),

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and

3 more.

evalplus by evalplus

LLM code evaluation framework for rigorous testing

Created 2 years ago

Updated 4 months ago

Feedback? Help us improve.