agentevals  by langchain-ai

Evaluators for agent trajectories

Created 10 months ago
444 stars

Top 67.4% on SourcePulse

GitHubView on GitHub
Project Summary

This library provides evaluators for agent trajectories, helping developers understand and improve the intermediate steps LLM agents take to solve problems. It offers various evaluation methods, including LLM-as-judge and direct trajectory matching, catering to developers building complex agentic applications.

How It Works

AgentEvals offers several evaluation strategies for agent trajectories, which are sequences of messages or graph steps. Trajectory match evaluators compare an agent's output against a reference trajectory using modes like "strict," "unordered," "subset," or "superset." LLM-as-judge evaluators use a language model to score the trajectory's accuracy, efficiency, and logical progression, with options to include reference trajectories or customize prompts. Graph trajectory evaluators specifically handle agents modeled as graphs, assessing sequences of nodes and steps.

Quick Start & Requirements

  • Installation: pip install agentevals (Python) or npm install agentevals @langchain/core (TypeScript).
  • Prerequisites: For LLM-as-judge evaluators, an OpenAI API key is required and should be set as an environment variable (OPENAI_API_KEY). LangChain integrations are used by default, but direct OpenAI client usage is also supported.
  • Demo: The README provides detailed Python and TypeScript examples for various evaluators.

Highlighted Details

  • Supports flexible tool argument matching (exact, ignore, subset, superset, custom overrides).
  • Includes specialized evaluators for graph-based agent trajectories (e.g., from LangGraph).
  • Offers asynchronous support for all evaluators.
  • Integrates with LangSmith for experiment tracking and evaluation logging.

Maintenance & Community

The project is associated with LangChainAI and can be found on X @LangChainAI. Issues and suggestions can be raised on their GitHub repository.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking would depend on the final license.

Limitations & Caveats

The README does not specify any limitations or known issues. The LLM-as-judge evaluators rely on external LLM providers, which may introduce variability or cost.

Health Check
Last Commit

4 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
26 stars in the last 30 days

Explore Similar Projects

Starred by Jerry Tworek Jerry Tworek(VP Research at OpenAI), Jianwei Yang Jianwei Yang(Research Scientist at Meta Superintelligence Lab), and
1 more.

pytorch-rl by jingweiz

0%
803
Deep RL research with PyTorch and Visdom
Created 8 years ago
Updated 5 years ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Simon Willison Simon Willison(Coauthor of Django), and
2 more.

tau-bench by sierra-research

0.8%
1k
Benchmark for tool-agent-user interaction research
Created 1 year ago
Updated 4 months ago
Feedback? Help us improve.