skillsbench by benchflow-ai

Benchmark for AI agent skill utilization

Created 2 months ago

703 stars

Top 48.6% on SourcePulse

View on GitHub

2 Experts Love This Project

Shane Thomas

Cofounder of Mastra

Jasper Zhang

Cofounder of Hyperbolic

Project Summary

SkillsBench is a new benchmark designed to evaluate the effectiveness of AI agents in utilizing modular "skills" for complex workflows. It targets AI researchers and developers, providing a standardized framework to measure both skill performance and agent behavior, aiming to accelerate the development of more capable AI systems.

How It Works

The project employs a gym-style benchmarking methodology, treating skill utilization as an evaluable environment. It focuses on assessing how agents compose and apply multiple skills (2+) to solve tasks, a key area where current state-of-the-art models often struggle. This approach allows for systematic measurement of agent capabilities beyond single-skill execution.

Quick Start & Requirements

Installation involves uv tool install harbor. Users clone the repository and initialize tasks using harbor tasks init. Task validation and execution are performed via harbor tasks check and harbor run. A critical prerequisite is setting API keys for target models (e.g., OpenAI, Anthropic) as environment variables or via a .envrc file. Links to official documentation and community channels are available.

Highlighted Details

This project introduces the first benchmark specifically for AI agent skill evaluation. It aims for state-of-the-art performance below 50% on tasks requiring skill composition, targeting advanced models like Claude Opus 4.5 and GPT-5.2. The framework supports batch evaluation of multiple agents concurrently using YAML configuration files, facilitating large-scale testing.

Maintenance & Community

The project is actively seeking community involvement, with a Discord server and weekly sync meetings scheduled for Mondays at 5 PM PT. Contributions are welcomed via a dedicated contributing guide.

Licensing & Compatibility

SkillsBench is released under the Apache 2.0 license, which is permissive and generally compatible with commercial use and integration into closed-source projects.

Limitations & Caveats

The project is explicitly marked as a Work In Progress (WIP), indicating potential for ongoing changes and incomplete features. Running task checks requires valid API keys for the models being tested.

Health Check

Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

400 stars in the last 30 days