myclaw-bench by LeoYeAI

Benchmark for AI agents on OpenClaw

Created 1 month ago

281 stars

Top 92.6% on SourcePulse

Project Summary

Summary

MyClaw Bench addresses the inadequacy of AI agent benchmarks that prioritize format compliance over actual task completion. It provides a definitive evaluation suite for AI agents on OpenClaw, targeting engineers and researchers. The benefit is a clear understanding of which LLMs excel in real-world agent scenarios, focusing on outcome, reasoning, safety, and efficiency.

How It Works

This benchmark employs a multi-dimensional scoring system across 45 tasks spanning four difficulty tiers: Foundation, Reasoning, Mastery, and Frontier (including Computer Use). Its core innovation lies in semantic grading, which assesses task outcomes through proper parsing and semantic correctness rather than brittle regex matching. The approach prioritizes real-world applicability by evaluating reasoning, safety, efficiency, and resilience, moving beyond simple success rates.

Quick Start & Requirements

Installation involves cloning the repository. Execution is script-driven, with primary commands like ./scripts/run.sh --model <MODEL_ID> for all tasks or --tier <TIER_NAME> for specific difficulty levels. Requirements include Python 3.10+, the uv package manager, a running OpenClaw instance, and an API key for the tested model. Official links: Leaderboard (bench.myclaw.ai), OpenClaw (github.com/openclaw/openclaw).

Highlighted Details

Features 45 tasks across Foundation, Reasoning, Mastery, and Frontier tiers, designed to differentiate AI agent capabilities.
Evaluates agents across Outcome, Reasoning, Safety, Efficiency, Resilience, and Consistency dimensions.
Scoring is multi-faceted: Success (35%), Efficiency (15%), Safety (20%), Consistency (10%), and Frontier intelligence (20%).
Frontier tasks specifically target advanced capabilities like metacognition, inductive reasoning, and calibrated safety judgment.
Computer Use tasks necessitate real browser interaction, creating a significant performance gap for models lacking this functionality.
Benchmark data is derived from over 10,000 real agent sessions.

Maintenance & Community

The repository welcomes new task contributions via tasks/TASK_TEMPLATE.md. Issue tracking is available on GitHub. Specific details on active maintainers, community channels (e.g., Discord/Slack), or roadmaps are not detailed in the provided README.

Licensing & Compatibility

The project is released under the MIT license, generally permitting broad use, including commercial applications and linking with closed-source software, subject to the license terms.

Limitations & Caveats

Models lacking Computer Use capabilities will score zero on relevant tasks, creating a substantial performance disparity. The benchmark's most discriminating tiers (Frontier and Computer Use) highlight significant capability gaps for less advanced models. Some grading relies on LLM judges, necessitating human audit for full transparency.

myclaw-bench by LeoYeAI

Explore Similar Projects

WildClawBench by InternLM

GitTaskBench by QuantaAlpha

claw-eval by claw-eval

mcpmark by eval-sys

auto-harness by neosigmaai

Auto-GPT-Benchmarks by Significant-Gravitas

openbench by groq

agent-as-a-judge by metauto-ai

skill by pinchbench

TheAgentCompany by TheAgentCompany

skillsbench by benchflow-ai

terminal-bench by harbor-framework