claw-eval  by claw-eval

Evaluate LLM agents in real-world scenarios with a transparent benchmark

Created 1 month ago
391 stars

Top 73.4% on SourcePulse

GitHubView on GitHub
Project Summary

Claw-Eval provides an end-to-end, transparent benchmark for evaluating Large Language Model (LLM) agents acting in real-world scenarios. It addresses the need for reproducible and robust evaluation by offering 139 human-verified tasks, 15 integrated services, and Docker-based sandboxing. This allows researchers and developers to rigorously assess agent performance with a focus on reliability and real-world complexity.

How It Works

The core methodology employs Docker sandboxes for isolated execution and a strict "Pass^3" metric, requiring models to succeed in three independent trials to achieve a pass credit. This approach aims to eliminate "lucky runs" and ensure consistent performance. The system is designed for end-to-end reproducibility, with mechanisms to handle API instability by re-triggering evaluations, ensuring three successful trajectories are generated.

Quick Start & Requirements

  • Installation: Use uv for dependency management (pip install uv, uv venv --python 3.11, source .venv/bin/activate).
  • Prerequisites: API keys for OPENROUTER_API_KEY and SERP_DEV_KEY (for web search tasks) are required. Python 3.11 is recommended.
  • Execution: Run sandbox tests via scripts/test_sandbox.sh or batch evaluations using claw-eval batch --config <config_file> --sandbox --trials 3 --parallel <N>.
  • Documentation: Leaderboard and task details are available at https://claw-eval.github.io.

Highlighted Details

  • Supports 139 tasks and 23 models, with a live leaderboard.
  • Introduced Pass^3 metric for enhanced reliability across three independent trials.
  • Version 1.1.0 adds 35 multimodal agentic tasks, expanding capabilities to perception, reasoning, creation, and delivery.
  • Commitment to end-to-end reproducibility, with ongoing codebase audits.

Maintenance & Community

  • Key contributors include Bowen Ye, Rang Li, Qibin Yang, Zhihui Xie, and Lei Li (Project Lead).
  • No explicit community channels (Discord/Slack) or social media links are provided in the README. A roadmap is outlined.

Licensing & Compatibility

  • Licensed under the MIT license, permitting broad use, including commercial applications.

Limitations & Caveats

  • The codebase is currently undergoing an audit to ensure full reproducibility of benchmark results.
  • Future roadmap items indicate potential for enhanced scoring logic, state verification, and sandbox isolation, suggesting these areas may be less mature in the current version.
  • The project appears relatively new, with core releases documented from March 2026.
Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
8
Issues (30d)
21
Star History
345 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.