AgentEval by canwhite

Agent evaluation and debugging toolkit

Created 4 weeks ago

New!

291 stars

Top 90.3% on SourcePulse

Project Summary

A transparent HTTP proxy for evaluating AI agents, AgentEval captures and structures agent-LLM API traffic. It automates conversation splitting, multi-dimensional grading, rule-based behavioral diagnosis, and LLM-driven configuration probing, offering insights via a web dashboard. This tool is designed for developers and researchers seeking to objectively assess and debug AI agent performance.

How It Works

AgentEval acts as an HTTP proxy, intercepting and logging all agent-LLM API communications. It automatically detects session boundaries using message rollback or idle timeouts, generating structured conversation views. The system then applies automated grading across four dimensions (task completion, tool efficiency, response quality, performance) using rule-based metrics and an LLM judge. Behavioral issues are diagnosed via a 10-rule engine, and an LLM probe, equipped with file access tools, analyzes the agent's source configuration for root causes. Results are presented through a local web dashboard.

Quick Start & Requirements

Installation: Build the Rust binary using cargo run.
Configuration: Set environment variables or a .env file for AGENTEVAL_UPSTREAM (LLM API), AGENTEVAL_PORT, AGENTEVAL_JUDGE_API_BASE, AGENTEVAL_JUDGE_MODEL, AGENTEVAL_JUDGE_API_KEY, and PROBE_SOURCE_PROJECT_DIR.
Agent Setup: Redirect your agent's BASE_URL to the AgentEval proxy address (e.g., http://127.0.0.1:57633).
Prerequisites: Rust toolchain (cargo), LLM API access for evaluation features.
Links: Web dashboard accessible at http://127.0.0.1:57633/dashboard/.

Highlighted Details

Comprehensive Traffic Analysis: Captures raw agent-LLM API traffic, logs it, and structures conversations.
Automated Evaluation Pipeline: Features session splitting, multi-dimensional auto-grading (task completion, tool efficiency, response quality, performance), rule-based diagnosis, and LLM-driven configuration probing.
Web Dashboard Interface: Provides a UI for session lists, detailed grading, diagnosis summaries, and probe findings.
Configurable LLM Judge: Allows specification of LLM endpoints and models for grading, diagnosis summarization, and probing.

Maintenance & Community

The provided README does not detail community channels, contributors, sponsorships, or a roadmap.

Licensing & Compatibility

The software's license is not specified in the README, making it impossible to determine compatibility for commercial use or closed-source linking.

Limitations & Caveats

The probe feature includes safety mechanisms like path sandboxing and read-only tools to prevent unintended modifications. LLM-based grading and diagnosis summarization are best-effort and may be skipped if LLM API access is unavailable. The probe requires explicit configuration of the agent's source project directory.

AgentEval by canwhite

Explore Similar Projects

kolo by koloai

wxmini-security-audit by sssmmmwww

clawmetry by vivekchand

opik-openclaw by comet-ml

AgentLab by ServiceNow

agentic-radar by splx-ai

webarena by web-arena-x

OpenClaw-bot-review by xmanrui

claude-code-reverse by Yuyz0112

agentops by AgentOps-AI

coze-loop by coze-dev

SWE-agent by SWE-agent