augment-swebench-agent  by augmentcode

Coding agent for SWE-bench evaluation

created 4 months ago
777 stars

Top 45.9% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides an open-source implementation of the SWE-bench benchmark agent, designed to evaluate AI systems on real-world software engineering tasks. It targets AI researchers and developers seeking to benchmark and improve LLM-based coding agents, offering a robust framework for complex problem-solving and code generation.

How It Works

The agent leverages Anthropic's Claude Sonnet 3.5 as its core driver, augmented by OpenAI's models for ensembling. It utilizes a system architecture forked from Anthropic's SWE-bench blog post, incorporating tools for bash command execution, file viewing/editing, and sequential problem-solving. A key feature is the majority vote ensembler, which uses an LLM to select the best solution from multiple generated candidates, enhancing reliability.

Quick Start & Requirements

  • Install: Clone the repository, run ./setup.sh, and activate the virtual environment (source .venv/bin/activate).
  • Prerequisites: Docker (v26.1.3 tested), Anthropic API key, OpenAI API key.
  • Usage:
    • Interactive mode: python cli.py
    • SWE-bench mode: python run_agent_on_swebench_problem.py --num-examples 5 --num-candidate-solutions 2
  • Resources: Running the full SWE-bench dataset (500 examples) with 8 processes per example and 10 shards can take several hours. High API rate limits are recommended.
  • Docs: SWE-bench Verified Agent

Highlighted Details

  • Achieved a 65.4% success rate on SWE-bench using off-the-shelf models.
  • Implements tools for code execution, file manipulation, and sequential reasoning.
  • Includes a majority vote ensembler for selecting optimal code solutions.
  • Supports parallel execution across multiple machines via sharding.

Maintenance & Community

The project is maintained by augmentcode. Contributions are welcome via issues or pull requests.

Licensing & Compatibility

  • License: MIT License.
  • Compatibility: Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

The project relies on external LLM APIs (Anthropic, OpenAI), incurring costs and potential rate limiting. Docker performance may degrade beyond 8 parallel processes.

Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
2
Issues (30d)
6
Star History
193 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.