Coding agent for SWE-bench evaluation
Top 45.9% on sourcepulse
This project provides an open-source implementation of the SWE-bench benchmark agent, designed to evaluate AI systems on real-world software engineering tasks. It targets AI researchers and developers seeking to benchmark and improve LLM-based coding agents, offering a robust framework for complex problem-solving and code generation.
How It Works
The agent leverages Anthropic's Claude Sonnet 3.5 as its core driver, augmented by OpenAI's models for ensembling. It utilizes a system architecture forked from Anthropic's SWE-bench blog post, incorporating tools for bash command execution, file viewing/editing, and sequential problem-solving. A key feature is the majority vote ensembler, which uses an LLM to select the best solution from multiple generated candidates, enhancing reliability.
Quick Start & Requirements
./setup.sh
, and activate the virtual environment (source .venv/bin/activate
).python cli.py
python run_agent_on_swebench_problem.py --num-examples 5 --num-candidate-solutions 2
Highlighted Details
Maintenance & Community
The project is maintained by augmentcode. Contributions are welcome via issues or pull requests.
Licensing & Compatibility
Limitations & Caveats
The project relies on external LLM APIs (Anthropic, OpenAI), incurring costs and potential rate limiting. Docker performance may degrade beyond 8 parallel processes.
1 month ago
1 day