auto-harness  by neosigmaai

Automated agent optimization framework

Created 1 week ago

New!

400 stars

Top 72.2% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

This repository provides a framework for creating self-improving AI coding agents. It automates the process of agent refinement by enabling agents to learn from benchmark failures, iteratively enhance their system prompts and tools, and validate changes against a self-maintained evaluation suite. This approach demonstrably boosts agent performance, as shown by a ~40% score improvement on the Tau3 benchmark, making it valuable for researchers and developers seeking to enhance agent capabilities without constant manual intervention.

How It Works

The system operates on a continuous loop: run a benchmark, analyze failures, improve the agent's code (agent/agent.py), gate the changes, record results, and update learnings. The core innovation lies in the agent's ability to autonomously identify failure patterns, update its own logic, and maintain a regression test suite (workspace/suite.json). Changes are rigorously gated by passing both the self-maintained eval suite and achieving a higher score on the full test set compared to previous bests. Learnings are logged persistently in workspace/learnings.md to preserve context across iterations.

Quick Start & Requirements

  • Primary install/run command: Docker is used for setup and execution. Key commands include docker compose build, docker compose run autoeval python prepare.py (initialization), and docker compose run autoeval python benchmark.py (running benchmarks).
  • Prerequisites: Docker, an OPENAI_API_KEY, and a compatible coding agent (e.g., Claude Code, Codex CLI). The TAU2_DATA_DIR environment variable must be set.
  • Links:

Highlighted Details

  • Achieved a significant performance jump on Tau3 benchmark tasks, improving agent scores from 0.56 to 0.78.
  • Features a self-maintaining regression eval suite (workspace/suite.json) that the agent updates dynamically.
  • Employs a robust two-step gating mechanism: eval suite pass rate and full test score improvement.
  • Maintains a persistent log (workspace/learnings.md) of agent actions, successes, and requests for human intervention.

Maintenance & Community

No specific details regarding maintainers, sponsorships, or community channels (e.g., Discord, Slack) are provided in the README.

Licensing & Compatibility

The license type and any compatibility notes for commercial use or closed-source linking are not specified in the provided README content.

Limitations & Caveats

The agent's modifications are currently restricted to the agent/agent.py file. The system relies on external coding agents and the OpenAI API, introducing external dependencies. While the framework is benchmark-agnostic, the provided example uses tau-bench.

Health Check
Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)
7
Issues (30d)
0
Star History
400 stars in the last 8 days

Explore Similar Projects

Feedback? Help us improve.