augment-swebench-agent by augmentcode

Coding agent for SWE-bench evaluation

Created 9 months ago

844 stars

Top 42.3% on SourcePulse

View on GitHub

3 Experts Love This Project

Project Summary

This project provides an open-source implementation of the SWE-bench benchmark agent, designed to evaluate AI systems on real-world software engineering tasks. It targets AI researchers and developers seeking to benchmark and improve LLM-based coding agents, offering a robust framework for complex problem-solving and code generation.

How It Works

The agent leverages Anthropic's Claude Sonnet 3.5 as its core driver, augmented by OpenAI's models for ensembling. It utilizes a system architecture forked from Anthropic's SWE-bench blog post, incorporating tools for bash command execution, file viewing/editing, and sequential problem-solving. A key feature is the majority vote ensembler, which uses an LLM to select the best solution from multiple generated candidates, enhancing reliability.

Quick Start & Requirements

Install: Clone the repository, run ./setup.sh, and activate the virtual environment (source .venv/bin/activate).
Prerequisites: Docker (v26.1.3 tested), Anthropic API key, OpenAI API key.
Usage:
- Interactive mode: python cli.py
- SWE-bench mode: python run_agent_on_swebench_problem.py --num-examples 5 --num-candidate-solutions 2
Resources: Running the full SWE-bench dataset (500 examples) with 8 processes per example and 10 shards can take several hours. High API rate limits are recommended.
Docs: SWE-bench Verified Agent

Highlighted Details

Achieved a 65.4% success rate on SWE-bench using off-the-shelf models.
Implements tools for code execution, file manipulation, and sequential reasoning.
Includes a majority vote ensembler for selecting optimal code solutions.
Supports parallel execution across multiple machines via sharding.

Maintenance & Community

The project is maintained by augmentcode. Contributions are welcome via issues or pull requests.

Licensing & Compatibility

License: MIT License.
Compatibility: Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

The project relies on external LLM APIs (Anthropic, OpenAI), incurring costs and potential rate limiting. Docker performance may degrade beyond 8 parallel processes.

Health Check

Last Commit

7 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

10 stars in the last 30 days