skillsbench  by benchflow-ai

Benchmark for AI agent skill utilization

Created 4 weeks ago

New!

262 stars

Top 97.3% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

SkillsBench is a new benchmark designed to evaluate the effectiveness of AI agents in utilizing modular "skills" for complex workflows. It targets AI researchers and developers, providing a standardized framework to measure both skill performance and agent behavior, aiming to accelerate the development of more capable AI systems.

How It Works

The project employs a gym-style benchmarking methodology, treating skill utilization as an evaluable environment. It focuses on assessing how agents compose and apply multiple skills (2+) to solve tasks, a key area where current state-of-the-art models often struggle. This approach allows for systematic measurement of agent capabilities beyond single-skill execution.

Quick Start & Requirements

Installation involves uv tool install harbor. Users clone the repository and initialize tasks using harbor tasks init. Task validation and execution are performed via harbor tasks check and harbor run. A critical prerequisite is setting API keys for target models (e.g., OpenAI, Anthropic) as environment variables or via a .envrc file. Links to official documentation and community channels are available.

Highlighted Details

This project introduces the first benchmark specifically for AI agent skill evaluation. It aims for state-of-the-art performance below 50% on tasks requiring skill composition, targeting advanced models like Claude Opus 4.5 and GPT-5.2. The framework supports batch evaluation of multiple agents concurrently using YAML configuration files, facilitating large-scale testing.

Maintenance & Community

The project is actively seeking community involvement, with a Discord server and weekly sync meetings scheduled for Mondays at 5 PM PT. Contributions are welcomed via a dedicated contributing guide.

Licensing & Compatibility

SkillsBench is released under the Apache 2.0 license, which is permissive and generally compatible with commercial use and integration into closed-source projects.

Limitations & Caveats

The project is explicitly marked as a Work In Progress (WIP), indicating potential for ongoing changes and incomplete features. Running task checks requires valid API keys for the models being tested.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
601
Issues (30d)
21
Star History
266 stars in the last 29 days

Explore Similar Projects

Feedback? Help us improve.