auto-harness by neosigmaai

Automated agent optimization framework

Created 3 months ago

520 stars

Top 59.7% on SourcePulse

View on GitHub

1 Expert Loves This Project

Shyamal Anadkat

Research Scientist at OpenAI

Project Summary

Summary

This repository provides a framework for creating self-improving AI coding agents. It automates the process of agent refinement by enabling agents to learn from benchmark failures, iteratively enhance their system prompts and tools, and validate changes against a self-maintained evaluation suite. This approach demonstrably boosts agent performance, as shown by a ~40% score improvement on the Tau3 benchmark, making it valuable for researchers and developers seeking to enhance agent capabilities without constant manual intervention.

How It Works

The system operates on a continuous loop: run a benchmark, analyze failures, improve the agent's code (agent/agent.py), gate the changes, record results, and update learnings. The core innovation lies in the agent's ability to autonomously identify failure patterns, update its own logic, and maintain a regression test suite (workspace/suite.json). Changes are rigorously gated by passing both the self-maintained eval suite and achieving a higher score on the full test set compared to previous bests. Learnings are logged persistently in workspace/learnings.md to preserve context across iterations.

Quick Start & Requirements

Primary install/run command: Docker is used for setup and execution. Key commands include docker compose build, docker compose run autoeval python prepare.py (initialization), and docker compose run autoeval python benchmark.py (running benchmarks).
Prerequisites: Docker, an OPENAI_API_KEY, and a compatible coding agent (e.g., Claude Code, Codex CLI). The TAU2_DATA_DIR environment variable must be set.
Links:
- Blog Post: https://www.neosigma.ai/blog/self-improving-agentic-systems
- Repository: https://github.com/neosigmaai/auto-harness

Highlighted Details

Achieved a significant performance jump on Tau3 benchmark tasks, improving agent scores from 0.56 to 0.78.
Features a self-maintaining regression eval suite (workspace/suite.json) that the agent updates dynamically.
Employs a robust two-step gating mechanism: eval suite pass rate and full test score improvement.
Maintains a persistent log (workspace/learnings.md) of agent actions, successes, and requests for human intervention.

Maintenance & Community

No specific details regarding maintainers, sponsorships, or community channels (e.g., Discord, Slack) are provided in the README.

Licensing & Compatibility

The license type and any compatibility notes for commercial use or closed-source linking are not specified in the provided README content.

Limitations & Caveats

The agent's modifications are currently restricted to the agent/agent.py file. The system relies on external coding agents and the OpenAI API, introducing external dependencies. While the framework is benchmark-agnostic, the provided example uses tau-bench.

Health Check

Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

11 stars in the last 30 days