mcpmark by eval-sys

Evaluate agentic models across diverse real-world tool environments

Created 5 months ago

330 stars

Top 82.7% on SourcePulse

Project Summary

Summary

MCPMark is a comprehensive benchmark suite designed to stress-test and evaluate agentic models within real-world Multi-Cloud Platform (MCP) tool environments. It targets researchers and engineers, providing a reproducible, extensible framework for assessing model capabilities across services like Notion, GitHub, Filesystem, Postgres, and Playwright. The benchmark offers automated verification, unified metrics, and aggregated reports, simplifying the evaluation of complex agent behaviors.

How It Works

The system evaluates agentic models by executing ready-to-run tasks across integrated MCP services. It employs isolated sandboxes to prevent data pollution and features auto-resume for failed tasks, ensuring reliability. Each task includes strict, automated verification for objective assessment. Core advantages include reproducible results, unified metrics (e.g., pass@k, avg@k), and aggregated reports, facilitating robust model performance analysis and comparison.

Quick Start & Requirements

Primary Install: Local (pip install -e .) or Docker (./build-docker.sh).
Prerequisites: Environment variables for service credentials (e.g., OPENAI_API_KEY, GITHUB_TOKENS), Playwright browsers (playwright install), and models configured via LiteLLM.
Setup: Quickstart is estimated at 5 minutes.
Links: Official Docs (https://mcpmark.ai/docs), Website (https://mcpmark.ai), Hugging Face Trajectory Logs (https://huggingface.co/datasets/Jakumetsu/mcpmark-trajectory-log).

Highlighted Details

Supports evaluation across Notion, GitHub, Filesystem, Postgres, and Playwright MCP services.
Features one-command tasks with strict, automated verification and isolated execution environments.
Includes auto-resume for failed tasks and unified metrics for single/multi-run evaluations (pass@k, avg@k).
Flexible deployment options: local (macOS/Linux validated) and Docker.

Maintenance & Community

Community support is available via Discord (https://discord.gg/HrKkJAxDnA).
Project details and research findings are linked via its website (https://mcpmark.ai) and an arXiv preprint (https://arxiv.org/abs/2509.24002).
Contribution guidelines are provided in docs/contributing/make-contribution.md.

Licensing & Compatibility

Licensed under the Apache License 2.0.
This license is permissive, generally allowing for commercial use and integration into closed-source projects.

Limitations & Caveats

The system relies on specific error string patterns for auto-resume; new or unhandled errors may require manual intervention or contribution.
Final aggregated reports only include models with complete results across all tasks and runs.
The arXiv paper's 2025 date suggests the project may be in active development or early research stages.

Health Check

Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

42 stars in the last 30 days