deep-swe  by datacurve-ai

Benchmark for evaluating AI software engineering agents on complex tasks

Created 4 weeks ago

New!

764 stars

Top 45.2% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

DeepSWE is a benchmark designed to rigorously measure the capabilities of frontier coding agents on complex, long-horizon software engineering tasks. It targets researchers and engineers evaluating AI agents, offering a standardized framework with isolated environments and objective, program-based verifiers to assess agent performance on real-world coding challenges.

How It Works

Tasks adhere to the Harbor format, comprising metadata (task.toml), agent prompts (instruction.md), reproducible environments (environment/), and automated verifiers (tests/test.sh). Solutions are evaluated based on observable behavior, not exact code matching. The benchmark utilizes Pier, a Harbor-compatible framework, for sandboxed agent execution. Pier enhances isolation by managing network access per agent and provides detailed trajectory logging, enabling robust and reproducible evaluations.

Quick Start & Requirements

Clone the repository and install the Pier framework using uv tool install datacurve-pier. Running tasks requires setting API keys for models (e.g., ANTHROPIC_API_KEY, OPENAI_API_KEY). Example commands demonstrate running tasks with agents like mini-swe-agent against specific models via Pier: pier run -p deep-swe/tasks --agent mini-swe-agent --model <model>. Tasks span multiple languages including TypeScript, Go, Python, JavaScript, and Rust.

Highlighted Details

  • Features 113 original, long-horizon software engineering tasks derived from active open-source projects.
  • Employs isolated execution environments and program-based verifiers for objective correctness assessment.
  • Supports multiple programming languages: TypeScript, Go, Python, JavaScript, and Rust.
  • Leverages the Pier framework for enhanced sandboxing, network control, and detailed evaluation trajectory logging.

Maintenance & Community

No specific details regarding maintainers, community channels (like Discord/Slack), or project roadmap were found in the provided README excerpt.

Licensing & Compatibility

The license type and any compatibility notes for commercial use or closed-source linking are not specified in the provided README excerpt.

Limitations & Caveats

The benchmark requires specific API keys for LLM access and familiarity with the Pier evaluation framework. The README excerpt does not detail any known bugs, alpha status, or unsupported platforms.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
4
Issues (30d)
30
Star History
769 stars in the last 28 days

Explore Similar Projects

Feedback? Help us improve.