deep-swe by datacurve-ai

Benchmark for evaluating AI software engineering agents on complex tasks

Created 2 months ago

1,191 stars

Top 32.0% on SourcePulse

View on GitHub

1 Expert Loves This Project

Edward Sun

Research Scientist at Meta Superintelligence Lab

Project Summary

Summary

DeepSWE is a benchmark designed to rigorously measure the capabilities of frontier coding agents on complex, long-horizon software engineering tasks. It targets researchers and engineers evaluating AI agents, offering a standardized framework with isolated environments and objective, program-based verifiers to assess agent performance on real-world coding challenges.

How It Works

Tasks adhere to the Harbor format, comprising metadata (task.toml), agent prompts (instruction.md), reproducible environments (environment/), and automated verifiers (tests/test.sh). Solutions are evaluated based on observable behavior, not exact code matching. The benchmark utilizes Pier, a Harbor-compatible framework, for sandboxed agent execution. Pier enhances isolation by managing network access per agent and provides detailed trajectory logging, enabling robust and reproducible evaluations.

Quick Start & Requirements

Clone the repository and install the Pier framework using uv tool install datacurve-pier. Running tasks requires setting API keys for models (e.g., ANTHROPIC_API_KEY, OPENAI_API_KEY). Example commands demonstrate running tasks with agents like mini-swe-agent against specific models via Pier: pier run -p deep-swe/tasks --agent mini-swe-agent --model <model>. Tasks span multiple languages including TypeScript, Go, Python, JavaScript, and Rust.

Highlighted Details

Features 113 original, long-horizon software engineering tasks derived from active open-source projects.
Employs isolated execution environments and program-based verifiers for objective correctness assessment.
Supports multiple programming languages: TypeScript, Go, Python, JavaScript, and Rust.
Leverages the Pier framework for enhanced sandboxing, network control, and detailed evaluation trajectory logging.

Maintenance & Community

No specific details regarding maintainers, community channels (like Discord/Slack), or project roadmap were found in the provided README excerpt.

Licensing & Compatibility

The license type and any compatibility notes for commercial use or closed-source linking are not specified in the provided README excerpt.

Limitations & Caveats

The benchmark requires specific API keys for LLM access and familiarity with the Pier evaluation framework. The README excerpt does not detail any known bugs, alpha status, or unsupported platforms.

Health Check

Last Commit

5 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

243 stars in the last 30 days