SWE-bench  by SWE-bench

Benchmark for evaluating LLMs on real-world GitHub issues

Created 1 year ago
3,538 stars

Top 13.7% on SourcePulse

GitHubView on GitHub
Project Summary

SWE-bench is a benchmark dataset and evaluation harness designed to assess the ability of large language models (LLMs) to resolve real-world software engineering issues sourced from GitHub. It targets AI researchers and developers building LLM-based code generation and debugging tools, offering a standardized method to measure model performance on complex, practical tasks.

How It Works

The benchmark provides a collection of GitHub repositories paired with specific issues. LLMs are tasked with generating a code patch that resolves the described issue. The evaluation harness utilizes Docker to create reproducible environments for testing these patches against the original codebases, ensuring consistent and reliable assessment of model capabilities.

Quick Start & Requirements

  • Install: pip install -e . after cloning the repository.
  • Prerequisites: Docker is required for reproducible evaluations. Recommended: x86_64 architecture, 120GB free storage, 16GB RAM, 8 CPU cores. Experimental support for arm64.
  • Setup: Follow the Docker setup guide.
  • Docs: Read the Docs

Highlighted Details

  • Includes SWE-bench Multimodal for evaluating generalization to visual software domains.
  • SWE-bench Verified offers a subset of 500 problems confirmed solvable by human engineers.
  • Supports cloud-based evaluations via sb-cli (AWS) and Modal.
  • Provides pre-processed datasets for training custom models and running inference.

Maintenance & Community

  • Active development with recent updates including multimodal support and containerized evaluation.
  • Contact: Carlos E. Jimenez (carlosej@princeton.edu), John Yang (johnby@stanford.edu).
  • Contributions and pull requests are welcomed.

Licensing & Compatibility

  • MIT License. Compatible with commercial use and closed-source linking.

Limitations & Caveats

  • Evaluation can be resource-intensive. Support for creating new SWE-bench instances is temporarily paused.
Health Check
Last Commit

19 hours ago

Responsiveness

1 week

Pull Requests (30d)
10
Issues (30d)
10
Star History
195 stars in the last 30 days

Explore Similar Projects

Starred by Eric Zhu Eric Zhu(Coauthor of AutoGen; Research Scientist at Microsoft Research) and Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA).

DS-1000 by xlang-ai

0.4%
256
Benchmark for data science code generation
Created 2 years ago
Updated 10 months ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Edward Z. Yang Edward Z. Yang(Research Engineer at Meta; Maintainer of PyTorch), and
5 more.

yet-another-applied-llm-benchmark by carlini

0.2%
1k
LLM benchmark for evaluating models on previously asked programming questions
Created 1 year ago
Updated 4 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Binyuan Hui Binyuan Hui(Research Scientist at Alibaba Qwen), and
2 more.

evalplus by evalplus

0.3%
2k
LLM code evaluation framework for rigorous testing
Created 2 years ago
Updated 1 month ago
Feedback? Help us improve.