Discover and explore top open-source AI tools and projects—updated daily.
scaleapiAI agents for long-horizon software engineering tasks
Top 95.2% on SourcePulse
Summary
SWE-Bench Pro is a comprehensive benchmark designed to evaluate the capabilities of AI agents in tackling long-horizon software engineering tasks. It challenges language models to generate code patches that resolve issues within given codebases. This project is aimed at AI researchers and developers building advanced software engineering agents, offering a standardized dataset and evaluation framework to measure and compare agent performance on complex, real-world-like problems.
How It Works
The benchmark provides a dataset of software engineering tasks, each consisting of a specific codebase and a detailed issue description. AI agents are tasked with producing a patch file that addresses the reported problem. The evaluation process leverages Docker for creating reproducible execution environments, ensuring that patch application and testing are consistent. For scalable evaluation across a large dataset, the framework integrates with Modal.
Quick Start & Requirements
datasets.load_dataset('ScaleAI/SWE-bench_Pro', split='test').pip install modal and configure via modal setup. Prebuilt Docker images for SWEAP are available on Docker Hub (jefzda/sweap-images).https://huggingface.co/datasets/ScaleAI/SWE-bench_Pro, Public Leaderboard: https://scale.com/leaderboard/swe_bench_pro_public.Highlighted Details
jefzda/sweap-images) to simplify agent environment setup.Maintenance & Community
Recent updates mention contributions from @miguelrc-scale and @18vijayb. The README does not provide direct links to community channels like Discord or Slack, nor does it outline a public roadmap.
Licensing & Compatibility
The specific open-source license for SWE-Bench Pro is not explicitly stated in the provided README text. This omission requires clarification for users considering commercial applications or integration with closed-source projects.
Limitations & Caveats
The README does not detail known limitations, bugs, or the project's development stage (e.g., alpha/beta). The dependency on Modal for scaled evaluations may represent a setup or cost consideration for some users. The setup process involves multiple configuration steps for Docker and Modal.
2 weeks ago
Inactive
SWE-Gym
groq
TheAgentCompany
SWE-bench
laude-institute