Discover and explore top open-source AI tools and projects—updated daily.
LeoYeAIBenchmark for AI agents on OpenClaw
Top 92.6% on SourcePulse
Summary
MyClaw Bench addresses the inadequacy of AI agent benchmarks that prioritize format compliance over actual task completion. It provides a definitive evaluation suite for AI agents on OpenClaw, targeting engineers and researchers. The benefit is a clear understanding of which LLMs excel in real-world agent scenarios, focusing on outcome, reasoning, safety, and efficiency.
How It Works
This benchmark employs a multi-dimensional scoring system across 45 tasks spanning four difficulty tiers: Foundation, Reasoning, Mastery, and Frontier (including Computer Use). Its core innovation lies in semantic grading, which assesses task outcomes through proper parsing and semantic correctness rather than brittle regex matching. The approach prioritizes real-world applicability by evaluating reasoning, safety, efficiency, and resilience, moving beyond simple success rates.
Quick Start & Requirements
Installation involves cloning the repository. Execution is script-driven, with primary commands like ./scripts/run.sh --model <MODEL_ID> for all tasks or --tier <TIER_NAME> for specific difficulty levels. Requirements include Python 3.10+, the uv package manager, a running OpenClaw instance, and an API key for the tested model. Official links: Leaderboard (bench.myclaw.ai), OpenClaw (github.com/openclaw/openclaw).
Highlighted Details
Maintenance & Community
The repository welcomes new task contributions via tasks/TASK_TEMPLATE.md. Issue tracking is available on GitHub. Specific details on active maintainers, community channels (e.g., Discord/Slack), or roadmaps are not detailed in the provided README.
Licensing & Compatibility
The project is released under the MIT license, generally permitting broad use, including commercial applications and linking with closed-source software, subject to the license terms.
Limitations & Caveats
Models lacking Computer Use capabilities will score zero on relevant tasks, creating a substantial performance disparity. The benchmark's most discriminating tiers (Frontier and Computer Use) highlight significant capability gaps for less advanced models. Some grading relies on LLM judges, necessitating human audit for full transparency.
1 month ago
Inactive
groq
TheAgentCompany
harbor-framework