agent-skills-eval by darkrishabh

Test runner for AI agent skills evaluation

Created 2 months ago

618 stars

Top 52.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Dan Guido

Cofounder of Trail of Bits

Project Summary

This tool provides an empirical evaluation framework for AI agent skills, specifically those adhering to the agentskills.io standard. It addresses the challenge of objectively proving whether integrating a custom skill improves an agent's performance on a given task. The tool is designed for developers and researchers working with AI agents who need to validate skill effectiveness before deployment, offering a clear, data-driven approach to performance assessment.

How It Works

The core mechanism involves running the same set of evaluation prompts twice: once with the agent skill loaded into the context (with_skill) and once without it (without_skill), establishing a baseline. A designated "judge" LLM then grades the outputs from both runs against predefined assertions and expected outcomes. This comparative analysis, coupled with judge-based scoring, provides empirical evidence of the skill's impact, or lack thereof.

Quick Start & Requirements

Primary install/run command: npm install agent-skills-eval or directly via npx agent-skills-eval.
Prerequisites: Node.js environment. Requires API keys and access to LLM APIs (e.g., OpenAI, Together, Groq, Anthropic via compatible layers, or local servers) for both the target model performing the task and the judge model. API keys are typically configured via environment variables (e.g., OPENAI_API_KEY).
Links:
- Documentation: https://darkrishabh.github.io/agent-skills-eval/
- Agent Skills Standard: https://agentskills.io

Highlighted Details

Empirical Validation: Directly compares with_skill versus without_skill runs to quantify performance lift.
Judge-Based Grading: Leverages any chat model as a judge for objective pass/fail assessments with cited assertions.
TypeScript SDK & CLI: Offers both a convenient command-line interface for CI/CD and a comprehensive SDK for custom integration into pipelines or dashboards.
Broad Model Compatibility: Works out-of-the-box with OpenAI-compatible APIs, supporting various providers and local model servers.
Tool-Call Assertions: Includes deterministic checks for agents that interact with tools, not just generate text.
Portable Artifacts & Reports: Generates standardized JSON/JSONL artifacts and self-contained static HTML reports for easy sharing and analysis.
Full agentskills.io Spec Compliance: Adheres strictly to the agentskills.io specification for SKILL.md, evals.json, and artifact layout.

Maintenance & Community

The project is actively maintained, with contributions welcomed via GitHub issues and pull requests. Specific community channels like Discord or Slack are not detailed in the README.

Licensing & Compatibility

License: MIT.
Compatibility: The MIT license permits broad use, including commercial applications and linking within closed-source projects. The tool is designed to be compatible with any LLM provider exposing an OpenAI-compatible API.

Limitations & Caveats

The effectiveness of the evaluation is dependent on the quality of the judge model and the comprehensiveness of the defined assertions and expected outputs. As the agentskills.io standard is relatively new, the ecosystem is still evolving, which may imply potential for future specification changes or early-stage adoption challenges.

Health Check

Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

36 stars in the last 30 days