Discover and explore top open-source AI tools and projects—updated daily.
Evaluate agentic models across diverse real-world tool environments
Top 98.8% on SourcePulse
Summary
MCPMark is a comprehensive benchmark suite designed to stress-test and evaluate agentic models within real-world Multi-Cloud Platform (MCP) tool environments. It targets researchers and engineers, providing a reproducible, extensible framework for assessing model capabilities across services like Notion, GitHub, Filesystem, Postgres, and Playwright. The benchmark offers automated verification, unified metrics, and aggregated reports, simplifying the evaluation of complex agent behaviors.
How It Works
The system evaluates agentic models by executing ready-to-run tasks across integrated MCP services. It employs isolated sandboxes to prevent data pollution and features auto-resume for failed tasks, ensuring reliability. Each task includes strict, automated verification for objective assessment. Core advantages include reproducible results, unified metrics (e.g., pass@k, avg@k), and aggregated reports, facilitating robust model performance analysis and comparison.
Quick Start & Requirements
pip install -e .
) or Docker (./build-docker.sh
).OPENAI_API_KEY
, GITHUB_TOKENS
), Playwright browsers (playwright install
), and models configured via LiteLLM.https://mcpmark.ai/docs
), Website (https://mcpmark.ai
), Hugging Face Trajectory Logs (https://huggingface.co/datasets/Jakumetsu/mcpmark-trajectory-log
).Highlighted Details
Maintenance & Community
https://discord.gg/HrKkJAxDnA
).https://mcpmark.ai
) and an arXiv preprint (https://arxiv.org/abs/2509.24002
).docs/contributing/make-contribution.md
.Licensing & Compatibility
Limitations & Caveats
2 days ago
Inactive