myclaw-bench  by LeoYeAI

Benchmark for AI agents on OpenClaw

Created 1 month ago
281 stars

Top 92.6% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

MyClaw Bench addresses the inadequacy of AI agent benchmarks that prioritize format compliance over actual task completion. It provides a definitive evaluation suite for AI agents on OpenClaw, targeting engineers and researchers. The benefit is a clear understanding of which LLMs excel in real-world agent scenarios, focusing on outcome, reasoning, safety, and efficiency.

How It Works

This benchmark employs a multi-dimensional scoring system across 45 tasks spanning four difficulty tiers: Foundation, Reasoning, Mastery, and Frontier (including Computer Use). Its core innovation lies in semantic grading, which assesses task outcomes through proper parsing and semantic correctness rather than brittle regex matching. The approach prioritizes real-world applicability by evaluating reasoning, safety, efficiency, and resilience, moving beyond simple success rates.

Quick Start & Requirements

Installation involves cloning the repository. Execution is script-driven, with primary commands like ./scripts/run.sh --model <MODEL_ID> for all tasks or --tier <TIER_NAME> for specific difficulty levels. Requirements include Python 3.10+, the uv package manager, a running OpenClaw instance, and an API key for the tested model. Official links: Leaderboard (bench.myclaw.ai), OpenClaw (github.com/openclaw/openclaw).

Highlighted Details

  • Features 45 tasks across Foundation, Reasoning, Mastery, and Frontier tiers, designed to differentiate AI agent capabilities.
  • Evaluates agents across Outcome, Reasoning, Safety, Efficiency, Resilience, and Consistency dimensions.
  • Scoring is multi-faceted: Success (35%), Efficiency (15%), Safety (20%), Consistency (10%), and Frontier intelligence (20%).
  • Frontier tasks specifically target advanced capabilities like metacognition, inductive reasoning, and calibrated safety judgment.
  • Computer Use tasks necessitate real browser interaction, creating a significant performance gap for models lacking this functionality.
  • Benchmark data is derived from over 10,000 real agent sessions.

Maintenance & Community

The repository welcomes new task contributions via tasks/TASK_TEMPLATE.md. Issue tracking is available on GitHub. Specific details on active maintainers, community channels (e.g., Discord/Slack), or roadmaps are not detailed in the provided README.

Licensing & Compatibility

The project is released under the MIT license, generally permitting broad use, including commercial applications and linking with closed-source software, subject to the license terms.

Limitations & Caveats

Models lacking Computer Use capabilities will score zero on relevant tasks, creating a substantial performance disparity. The benchmark's most discriminating tiers (Frontier and Computer Use) highlight significant capability gaps for less advanced models. Some grading relies on LLM judges, necessitating human audit for full transparency.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
0
Star History
331 stars in the last 30 days

Explore Similar Projects

Starred by Dan Abramov Dan Abramov(Core Contributor to React; Coauthor of Redux, Create React App), Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind), and
10 more.

terminal-bench by harbor-framework

4.0%
2k
Benchmark for LLM agents in real terminal environments
Created 1 year ago
Updated 2 months ago
Feedback? Help us improve.