self-improving-agents by BetterForAll

Evolving AI agents via adversarial and self-improvement loops

Created 2 months ago

266 stars

Top 96.2% on SourcePulse

Project Summary

Self-Improving Agents -- A Progression offers a structured framework for developing increasingly sophisticated AI agents capable of self-improvement. Targeting researchers and power users, this project progresses through four distinct levels, culminating in adversarial arenas, to build more robust and adaptable AI solutions that can overcome the limitations of static benchmarks.

How It Works

The project introduces four levels of self-improving code agents. Level 1, AutoResearch, uses a basic loop where an LLM iteratively refines code against a benchmark, leveraging verifiable rewards for autonomous operation. Level 2, Feedback Loop, enhances this by incorporating a reviewer agent that provides structured explanations for code failures, optimizing cost-performance through asymmetric information. Level 3, HyperAgent Loop, allows agents to rewrite their own source code, featuring a multi-stage validation process for safety and stability. Level 4, Arena Loop, employs adversarial co-evolution, pitting code agents against evolving test agents in a dynamic "arms race" to cultivate robustness against unforeseen edge cases, with Level 4b introducing tournament-style competition for strategy evolution.

Quick Start & Requirements

Installation requires pip install -r requirements.txt. A GEMINI_API_KEY obtained from Google AI Studio is a mandatory prerequisite. Individual levels can be executed using their respective run.py scripts (e.g., python autoresearch/run.py). Tasks such as 'snake', 'support', or 'email_validation' are selectable via the --task argument. Comprehensive experiment execution and cross-level analysis are facilitated by run_all.py and analyze_results.py.

Highlighted Details

Benchmark Brittleness: Fixed benchmarks can lead to brittle agent performance; Levels 1-3 demonstrated significant score degradation on adversarial tests compared to their original benchmarks.
Cost-Performance: The Feedback Loop (Level 2) presents a superior cost-performance tradeoff ($0.03 per task) relative to the Arena Loop ($0.63 per task).
Robustness Development: The Arena Loop (Level 4) effectively builds robustness, showing improved performance against dynamically evolving adversarial test suites.
Task Specialization: Each level is suited for different needs: Level 1 for simple tasks, Level 2 for quality assurance, and Level 4 for resilience against edge cases.
Visualizable AI: Agents developed for the Snake game can be visualized post-experimentation for qualitative assessment.

Maintenance & Community

The provided README does not detail specific maintenance contributors, sponsorships, or community channels such as Discord or Slack.

Licensing & Compatibility

License information is not specified in the provided README text.

Limitations & Caveats

Over-reliance on static benchmarks can create a false sense of security and result in brittle agent capabilities. The adversarial nature of Level 4 necessitates higher computational resources. As a research progression, the project may be subject to ongoing development and potential API changes.

self-improving-agents by BetterForAll

Explore Similar Projects

awesome-autoresearch by yibie

rosetta by griddynamics

auto-harness by neosigmaai

agentic-harness-engineering by china-qijizhifeng

HGM by metauto-ai

meta-agents-research-environments by facebookresearch

awesome-ralph by snwfdhmp

a-evolve by A-EVO-Lab

EvoSkill by sentient-agi

harness-engineering by deusyu

dgm by jennyzzt

hermes-agent-self-evolution by NousResearch