self-improving-agents  by BetterForAll

Evolving AI agents via adversarial and self-improvement loops

Created 2 months ago
266 stars

Top 96.2% on SourcePulse

GitHubView on GitHub
Project Summary

Self-Improving Agents -- A Progression offers a structured framework for developing increasingly sophisticated AI agents capable of self-improvement. Targeting researchers and power users, this project progresses through four distinct levels, culminating in adversarial arenas, to build more robust and adaptable AI solutions that can overcome the limitations of static benchmarks.

How It Works

The project introduces four levels of self-improving code agents. Level 1, AutoResearch, uses a basic loop where an LLM iteratively refines code against a benchmark, leveraging verifiable rewards for autonomous operation. Level 2, Feedback Loop, enhances this by incorporating a reviewer agent that provides structured explanations for code failures, optimizing cost-performance through asymmetric information. Level 3, HyperAgent Loop, allows agents to rewrite their own source code, featuring a multi-stage validation process for safety and stability. Level 4, Arena Loop, employs adversarial co-evolution, pitting code agents against evolving test agents in a dynamic "arms race" to cultivate robustness against unforeseen edge cases, with Level 4b introducing tournament-style competition for strategy evolution.

Quick Start & Requirements

Installation requires pip install -r requirements.txt. A GEMINI_API_KEY obtained from Google AI Studio is a mandatory prerequisite. Individual levels can be executed using their respective run.py scripts (e.g., python autoresearch/run.py). Tasks such as 'snake', 'support', or 'email_validation' are selectable via the --task argument. Comprehensive experiment execution and cross-level analysis are facilitated by run_all.py and analyze_results.py.

Highlighted Details

  • Benchmark Brittleness: Fixed benchmarks can lead to brittle agent performance; Levels 1-3 demonstrated significant score degradation on adversarial tests compared to their original benchmarks.
  • Cost-Performance: The Feedback Loop (Level 2) presents a superior cost-performance tradeoff ($0.03 per task) relative to the Arena Loop ($0.63 per task).
  • Robustness Development: The Arena Loop (Level 4) effectively builds robustness, showing improved performance against dynamically evolving adversarial test suites.
  • Task Specialization: Each level is suited for different needs: Level 1 for simple tasks, Level 2 for quality assurance, and Level 4 for resilience against edge cases.
  • Visualizable AI: Agents developed for the Snake game can be visualized post-experimentation for qualitative assessment.

Maintenance & Community

The provided README does not detail specific maintenance contributors, sponsorships, or community channels such as Discord or Slack.

Licensing & Compatibility

License information is not specified in the provided README text.

Limitations & Caveats

Over-reliance on static benchmarks can create a false sense of security and result in brittle agent capabilities. The adversarial nature of Level 4 necessitates higher computational resources. As a research progression, the project may be subject to ongoing development and potential API changes.

Health Check
Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
218 stars in the last 30 days

Explore Similar Projects

Starred by Peter Norvig Peter Norvig(Author of "Artificial Intelligence: A Modern Approach"; Research Director at Google), Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), and
3 more.

dgm by jennyzzt

0.6%
2k
Self-improving agent system
Created 1 year ago
Updated 10 months ago
Feedback? Help us improve.