agent-as-a-judge  by metauto-ai

Agent-as-a-Judge framework for agentic system evaluation

Created 11 months ago
629 stars

Top 52.7% on SourcePulse

GitHubView on GitHub
Project Summary

This project introduces the Agent-as-a-Judge framework, a novel approach to evaluating complex agentic systems. It addresses the limitations of traditional outcome-focused and labor-intensive evaluation methods by providing automated, step-by-step feedback, enabling scalable self-improvement for AI agents. The target audience includes researchers and developers working with advanced AI agents, particularly in coding and development tasks.

How It Works

Agent-as-a-Judge leverages a "judge" agent to evaluate the performance of other agents. This judge agent provides continuous, granular feedback during or after task execution, acting as a reward signal. This approach is advantageous as it automates a typically manual and time-consuming process, significantly reducing evaluation time and cost while offering richer, actionable insights for agent training.

Quick Start & Requirements

  • Installation: Clone the repository, create a Conda environment (python=3.11), activate it, and install dependencies using poetry install.
  • Prerequisites: Python 3.11, Poetry, and API keys for supported LLMs (configured via .env file, LiteLLM is used for compatibility).
  • Usage: Examples provided for general workspace queries (scripts/run_ask.py) and for evaluating development agents (scripts/run_aaaj.py).
  • Resources: Requires LLM API access. Setup time is minimal after environment setup.
  • Links: DevAI Dataset on Hugging Face, Paper

Highlighted Details

  • Claims 97.72% time and 97.64% cost savings compared to human experts.
  • Provides continuous reward signals for agent training.
  • Applied to the DevAI benchmark, consisting of 55 realistic AI development tasks.
  • Outperforms traditional evaluation methods in delivering reliable reward signals.

Maintenance & Community

The project is associated with authors from institutions including UC Berkeley and Google. Further community engagement details (e.g., Discord/Slack) are not explicitly mentioned in the README.

Licensing & Compatibility

The repository's license is not specified in the provided README. Users should verify licensing for commercial use or integration into closed-source projects.

Limitations & Caveats

The project is presented as a proof-of-concept, primarily demonstrated on coding generation tasks using the DevAI benchmark. Its performance and applicability to other agent types or domains may require further validation.

Health Check
Last Commit

4 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
29 stars in the last 30 days

Explore Similar Projects

Starred by Zhiqiang Xie Zhiqiang Xie(Coauthor of SGLang), Eric Zhu Eric Zhu(Coauthor of AutoGen; Research Scientist at Microsoft Research), and
3 more.

Trace by microsoft

0.5%
645
AutoDiff-like tool for end-to-end AI agent training with general feedback
Created 1 year ago
Updated 1 month ago
Starred by Eric Zhu Eric Zhu(Coauthor of AutoGen; Research Scientist at Microsoft Research) and Will Brown Will Brown(Research Lead at Prime Intellect).

agent-lightning by microsoft

6.0%
2k
Train any AI agent with rollouts and feedback
Created 3 months ago
Updated 2 days ago
Feedback? Help us improve.