Discover and explore top open-source AI tools and projects—updated daily.
mgechevTesting framework for AI agent skills
Top 77.8% on SourcePulse
Skillgrade provides an automated framework for evaluating the integration of AI agent skills, acting as a unit testing suite for agent capabilities. It targets developers building AI agents that need to reliably discover and utilize external tools or functions, offering a mechanism to ensure correct skill invocation and output validation. The primary benefit is enabling robust, automated quality assurance for agent-tool interactions.
How It Works
The core of Skillgrade lies in its declarative eval.yaml configuration, which defines tasks, agent workspaces, and grading criteria. It supports two primary grader types: deterministic graders execute predefined scripts (e.g., bash, Python) and parse structured JSON output for pass/fail metrics, while llm_rubric graders leverage large language models (like Gemini or Claude) to assess qualitative aspects of the agent's session transcript against specified criteria. Skillgrade orchestrates agent execution, often within Docker containers, and aggregates scores from these graders to produce a comprehensive evaluation report.
Quick Start & Requirements
npm i -g skillgradeSKILL.md), initialize with GEMINI_API_KEY=your-key skillgrade init (or ANTHROPIC_API_KEY/OPENAI_API_KEY), customize the generated eval.yaml, and run evaluations using commands like GEMINI_API_KEY=your-key skillgrade --smoke. API keys are used for agent execution and LLM grading.skillgrade preview) or a local web UI (skillgrade preview browser).Highlighted Details
eval.yaml for defining tasks, environment setup, and complex grading logic.--ci flag, enabling automated quality gates.Maintenance & Community
The project is inspired by related work like SkillsBench. While specific community channels (Discord/Slack), active maintainer information, or a public roadmap are not detailed in the README, the project provides a link to install a related "skill creator" skill (npx skills add mgechev/skills-best-practices). No sponsorships or partnerships are mentioned.
Licensing & Compatibility
Skillgrade is released under the MIT License, which is highly permissive and generally compatible with commercial use and integration into closed-source projects.
Limitations & Caveats
The tool requires Node.js 20+ and Docker, which may be adoption blockers for some environments. Deterministic grading scripts must correctly format JSON output, and awk is recommended over bc for arithmetic within bash scripts due to container image limitations. The effectiveness of LLM-based grading is dependent on the chosen LLM and the clarity of the provided rubric.
2 weeks ago
Inactive
TheAgentCompany