Discover and explore top open-source AI tools and projects—updated daily.
benchflow-aiBenchmark for AI agent skill utilization
New!
Top 97.3% on SourcePulse
SkillsBench is a new benchmark designed to evaluate the effectiveness of AI agents in utilizing modular "skills" for complex workflows. It targets AI researchers and developers, providing a standardized framework to measure both skill performance and agent behavior, aiming to accelerate the development of more capable AI systems.
How It Works
The project employs a gym-style benchmarking methodology, treating skill utilization as an evaluable environment. It focuses on assessing how agents compose and apply multiple skills (2+) to solve tasks, a key area where current state-of-the-art models often struggle. This approach allows for systematic measurement of agent capabilities beyond single-skill execution.
Quick Start & Requirements
Installation involves uv tool install harbor. Users clone the repository and initialize tasks using harbor tasks init. Task validation and execution are performed via harbor tasks check and harbor run. A critical prerequisite is setting API keys for target models (e.g., OpenAI, Anthropic) as environment variables or via a .envrc file. Links to official documentation and community channels are available.
Highlighted Details
This project introduces the first benchmark specifically for AI agent skill evaluation. It aims for state-of-the-art performance below 50% on tasks requiring skill composition, targeting advanced models like Claude Opus 4.5 and GPT-5.2. The framework supports batch evaluation of multiple agents concurrently using YAML configuration files, facilitating large-scale testing.
Maintenance & Community
The project is actively seeking community involvement, with a Discord server and weekly sync meetings scheduled for Mondays at 5 PM PT. Contributions are welcomed via a dedicated contributing guide.
Licensing & Compatibility
SkillsBench is released under the Apache 2.0 license, which is permissive and generally compatible with commercial use and integration into closed-source projects.
Limitations & Caveats
The project is explicitly marked as a Work In Progress (WIP), indicating potential for ongoing changes and incomplete features. Running task checks requires valid API keys for the models being tested.
1 day ago
Inactive
TheAgentCompany