skillgrade  by mgechev

Testing framework for AI agent skills

Created 1 month ago
361 stars

Top 77.8% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Skillgrade provides an automated framework for evaluating the integration of AI agent skills, acting as a unit testing suite for agent capabilities. It targets developers building AI agents that need to reliably discover and utilize external tools or functions, offering a mechanism to ensure correct skill invocation and output validation. The primary benefit is enabling robust, automated quality assurance for agent-tool interactions.

How It Works

The core of Skillgrade lies in its declarative eval.yaml configuration, which defines tasks, agent workspaces, and grading criteria. It supports two primary grader types: deterministic graders execute predefined scripts (e.g., bash, Python) and parse structured JSON output for pass/fail metrics, while llm_rubric graders leverage large language models (like Gemini or Claude) to assess qualitative aspects of the agent's session transcript against specified criteria. Skillgrade orchestrates agent execution, often within Docker containers, and aggregates scores from these graders to produce a comprehensive evaluation report.

Quick Start & Requirements

  • Installation: npm i -g skillgrade
  • Prerequisites: Node.js 20+, Docker.
  • Setup: Navigate to your skill directory (containing SKILL.md), initialize with GEMINI_API_KEY=your-key skillgrade init (or ANTHROPIC_API_KEY/OPENAI_API_KEY), customize the generated eval.yaml, and run evaluations using commands like GEMINI_API_KEY=your-key skillgrade --smoke. API keys are used for agent execution and LLM grading.
  • Output: Reports are available via CLI (skillgrade preview) or a local web UI (skillgrade preview browser).

Highlighted Details

  • Flexible eval.yaml for defining tasks, environment setup, and complex grading logic.
  • Hybrid grading approach combining precise, deterministic checks with nuanced LLM-based qualitative assessments.
  • Built-in support for CI/CD pipelines via the --ci flag, enabling automated quality gates.
  • Automatic agent detection (Gemini, Claude, Codex) based on provided API keys, with manual override options.
  • Support for file-based instructions and rubrics, allowing for more organized and extensive test definitions.

Maintenance & Community

The project is inspired by related work like SkillsBench. While specific community channels (Discord/Slack), active maintainer information, or a public roadmap are not detailed in the README, the project provides a link to install a related "skill creator" skill (npx skills add mgechev/skills-best-practices). No sponsorships or partnerships are mentioned.

Licensing & Compatibility

Skillgrade is released under the MIT License, which is highly permissive and generally compatible with commercial use and integration into closed-source projects.

Limitations & Caveats

The tool requires Node.js 20+ and Docker, which may be adoption blockers for some environments. Deterministic grading scripts must correctly format JSON output, and awk is recommended over bc for arithmetic within bash scripts due to container image limitations. The effectiveness of LLM-based grading is dependent on the chosen LLM and the clarity of the provided rubric.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
3
Issues (30d)
3
Star History
246 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.