agentevals by langchain-ai

Evaluators for agent trajectories

created 5 months ago

282 stars

Top 93.5% on sourcepulse

Project Summary

This library provides evaluators for agent trajectories, helping developers understand and improve the intermediate steps LLM agents take to solve problems. It offers various evaluation methods, including LLM-as-judge and direct trajectory matching, catering to developers building complex agentic applications.

How It Works

AgentEvals offers several evaluation strategies for agent trajectories, which are sequences of messages or graph steps. Trajectory match evaluators compare an agent's output against a reference trajectory using modes like "strict," "unordered," "subset," or "superset." LLM-as-judge evaluators use a language model to score the trajectory's accuracy, efficiency, and logical progression, with options to include reference trajectories or customize prompts. Graph trajectory evaluators specifically handle agents modeled as graphs, assessing sequences of nodes and steps.

Quick Start & Requirements

Installation: pip install agentevals (Python) or npm install agentevals @langchain/core (TypeScript).
Prerequisites: For LLM-as-judge evaluators, an OpenAI API key is required and should be set as an environment variable (OPENAI_API_KEY). LangChain integrations are used by default, but direct OpenAI client usage is also supported.
Demo: The README provides detailed Python and TypeScript examples for various evaluators.

Highlighted Details

Supports flexible tool argument matching (exact, ignore, subset, superset, custom overrides).
Includes specialized evaluators for graph-based agent trajectories (e.g., from LangGraph).
Offers asynchronous support for all evaluators.
Integrates with LangSmith for experiment tracking and evaluation logging.

Maintenance & Community

The project is associated with LangChainAI and can be found on X @LangChainAI. Issues and suggestions can be raised on their GitHub repository.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking would depend on the final license.

Limitations & Caveats

The README does not specify any limitations or known issues. The LLM-as-judge evaluators rely on external LLM providers, which may introduce variability or cost.

agentevals by langchain-ai

Explore Similar Projects

saplings by shobrook

js-agent by lgrammel

taskgen by simbianai

AgentKit by Holmeswww

MLAgentBench by snap-stanford

openevals by langchain-ai

AgentGym by WooooDyy

GPTSwarm by metauto-ai

AgentLab by ServiceNow

graph_websearch_agent by john-adeojo

openai-agents-python by openai

evals by openai