SWE-bench_Pro-os by scaleapi

AI agents for long-horizon software engineering tasks

Created 10 months ago

474 stars

Top 63.6% on SourcePulse

View on GitHub

2 Experts Love This Project

Luis Capelo

Cofounder of Lightning AI

Simon Willison

Coauthor of Django

Project Summary

Summary

SWE-Bench Pro is a comprehensive benchmark designed to evaluate the capabilities of AI agents in tackling long-horizon software engineering tasks. It challenges language models to generate code patches that resolve issues within given codebases. This project is aimed at AI researchers and developers building advanced software engineering agents, offering a standardized dataset and evaluation framework to measure and compare agent performance on complex, real-world-like problems.

How It Works

The benchmark provides a dataset of software engineering tasks, each consisting of a specific codebase and a detailed issue description. AI agents are tasked with producing a patch file that addresses the reported problem. The evaluation process leverages Docker for creating reproducible execution environments, ensuring that patch application and testing are consistent. For scalable evaluation across a large dataset, the framework integrates with Modal.

Quick Start & Requirements

Installation: Load the dataset via HuggingFace: datasets.load_dataset('ScaleAI/SWE-bench_Pro', split='test').
Prerequisites: Docker is essential for reproducible evaluations. Modal is required for scaling evaluations; install with pip install modal and configure via modal setup. Prebuilt Docker images for SWEAP are available on Docker Hub (jefzda/sweap-images).
Links: HuggingFace Dataset: https://huggingface.co/datasets/ScaleAI/SWE-bench_Pro, Public Leaderboard: https://scale.com/leaderboard/swe_bench_pro_public.

Highlighted Details

Focuses on "long-horizon" software engineering tasks, pushing beyond simple code generation to complex problem-solving.
Provides a standardized benchmark and evaluation framework for AI agents.
Employs Docker for reproducible execution environments and Modal for scalable evaluation infrastructure.
Offers pre-built Docker images (jefzda/sweap-images) to simplify agent environment setup.

Maintenance & Community

Recent updates mention contributions from @miguelrc-scale and @18vijayb. The README does not provide direct links to community channels like Discord or Slack, nor does it outline a public roadmap.

Licensing & Compatibility

The specific open-source license for SWE-Bench Pro is not explicitly stated in the provided README text. This omission requires clarification for users considering commercial applications or integration with closed-source projects.

Limitations & Caveats

The README does not detail known limitations, bugs, or the project's development stage (e.g., alpha/beta). The dependency on Modal for scaled evaluations may represent a setup or cost consideration for some users. The setup process involves multiple configuration steps for Docker and Modal.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

37 stars in the last 30 days