Discover and explore top open-source AI tools and projects—updated daily.
claw-evalEvaluate LLM agents in real-world scenarios with a transparent benchmark
Top 73.4% on SourcePulse
Claw-Eval provides an end-to-end, transparent benchmark for evaluating Large Language Model (LLM) agents acting in real-world scenarios. It addresses the need for reproducible and robust evaluation by offering 139 human-verified tasks, 15 integrated services, and Docker-based sandboxing. This allows researchers and developers to rigorously assess agent performance with a focus on reliability and real-world complexity.
How It Works
The core methodology employs Docker sandboxes for isolated execution and a strict "Pass^3" metric, requiring models to succeed in three independent trials to achieve a pass credit. This approach aims to eliminate "lucky runs" and ensure consistent performance. The system is designed for end-to-end reproducibility, with mechanisms to handle API instability by re-triggering evaluations, ensuring three successful trajectories are generated.
Quick Start & Requirements
uv for dependency management (pip install uv, uv venv --python 3.11, source .venv/bin/activate).OPENROUTER_API_KEY and SERP_DEV_KEY (for web search tasks) are required. Python 3.11 is recommended.scripts/test_sandbox.sh or batch evaluations using claw-eval batch --config <config_file> --sandbox --trials 3 --parallel <N>.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
2 days ago
Inactive
hkust-nlp
THUDM
xlang-ai