Discover and explore top open-source AI tools and projects—updated daily.
datacurve-aiBenchmark for evaluating AI software engineering agents on complex tasks
New!
Top 45.2% on SourcePulse
Summary
DeepSWE is a benchmark designed to rigorously measure the capabilities of frontier coding agents on complex, long-horizon software engineering tasks. It targets researchers and engineers evaluating AI agents, offering a standardized framework with isolated environments and objective, program-based verifiers to assess agent performance on real-world coding challenges.
How It Works
Tasks adhere to the Harbor format, comprising metadata (task.toml), agent prompts (instruction.md), reproducible environments (environment/), and automated verifiers (tests/test.sh). Solutions are evaluated based on observable behavior, not exact code matching. The benchmark utilizes Pier, a Harbor-compatible framework, for sandboxed agent execution. Pier enhances isolation by managing network access per agent and provides detailed trajectory logging, enabling robust and reproducible evaluations.
Quick Start & Requirements
Clone the repository and install the Pier framework using uv tool install datacurve-pier. Running tasks requires setting API keys for models (e.g., ANTHROPIC_API_KEY, OPENAI_API_KEY). Example commands demonstrate running tasks with agents like mini-swe-agent against specific models via Pier: pier run -p deep-swe/tasks --agent mini-swe-agent --model <model>. Tasks span multiple languages including TypeScript, Go, Python, JavaScript, and Rust.
Highlighted Details
Maintenance & Community
No specific details regarding maintainers, community channels (like Discord/Slack), or project roadmap were found in the provided README excerpt.
Licensing & Compatibility
The license type and any compatibility notes for commercial use or closed-source linking are not specified in the provided README excerpt.
Limitations & Caveats
The benchmark requires specific API keys for LLM access and familiarity with the Pier evaluation framework. The README excerpt does not detail any known bugs, alpha status, or unsupported platforms.
1 week ago
Inactive