SWE-bench_Pro-os  by scaleapi

AI agents for long-horizon software engineering tasks

Created 5 months ago
271 stars

Top 95.2% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

SWE-Bench Pro is a comprehensive benchmark designed to evaluate the capabilities of AI agents in tackling long-horizon software engineering tasks. It challenges language models to generate code patches that resolve issues within given codebases. This project is aimed at AI researchers and developers building advanced software engineering agents, offering a standardized dataset and evaluation framework to measure and compare agent performance on complex, real-world-like problems.

How It Works

The benchmark provides a dataset of software engineering tasks, each consisting of a specific codebase and a detailed issue description. AI agents are tasked with producing a patch file that addresses the reported problem. The evaluation process leverages Docker for creating reproducible execution environments, ensuring that patch application and testing are consistent. For scalable evaluation across a large dataset, the framework integrates with Modal.

Quick Start & Requirements

  • Installation: Load the dataset via HuggingFace: datasets.load_dataset('ScaleAI/SWE-bench_Pro', split='test').
  • Prerequisites: Docker is essential for reproducible evaluations. Modal is required for scaling evaluations; install with pip install modal and configure via modal setup. Prebuilt Docker images for SWEAP are available on Docker Hub (jefzda/sweap-images).
  • Links: HuggingFace Dataset: https://huggingface.co/datasets/ScaleAI/SWE-bench_Pro, Public Leaderboard: https://scale.com/leaderboard/swe_bench_pro_public.

Highlighted Details

  • Focuses on "long-horizon" software engineering tasks, pushing beyond simple code generation to complex problem-solving.
  • Provides a standardized benchmark and evaluation framework for AI agents.
  • Employs Docker for reproducible execution environments and Modal for scalable evaluation infrastructure.
  • Offers pre-built Docker images (jefzda/sweap-images) to simplify agent environment setup.

Maintenance & Community

Recent updates mention contributions from @miguelrc-scale and @18vijayb. The README does not provide direct links to community channels like Discord or Slack, nor does it outline a public roadmap.

Licensing & Compatibility

The specific open-source license for SWE-Bench Pro is not explicitly stated in the provided README text. This omission requires clarification for users considering commercial applications or integration with closed-source projects.

Limitations & Caveats

The README does not detail known limitations, bugs, or the project's development stage (e.g., alpha/beta). The dependency on Modal for scaled evaluations may represent a setup or cost consideration for some users. The setup process involves multiple configuration steps for Docker and Modal.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
4
Star History
23 stars in the last 30 days

Explore Similar Projects

Starred by Dan Abramov Dan Abramov(Core Contributor to React; Coauthor of Redux, Create React App), Gabriel Almeida Gabriel Almeida(Cofounder of Langflow), and
9 more.

terminal-bench by laude-institute

1.4%
2k
Benchmark for LLM agents in real terminal environments
Created 1 year ago
Updated 1 month ago
Feedback? Help us improve.