mcp-bench  by Accenture

Evaluate LLM agents' tool-use capabilities

Created 2 months ago
366 stars

Top 76.9% on SourcePulse

GitHubView on GitHub
Project Summary

MCP-Bench is an evaluation framework for benchmarking Large Language Model (LLM) agents' tool-use capabilities in complex, real-world tasks via the Model Context Protocol (MCP). It offers an end-to-end pipeline for assessing how effectively LLMs discover, select, and utilize tools, providing valuable insights for researchers and developers in the LLM agent space.

How It Works

The framework employs the Model Context Protocol (MCP) to facilitate LLM interaction with a suite of 28 diverse real-world services, referred to as MCP servers. It orchestrates benchmark runs, evaluating LLMs on their ability to understand task requirements, select appropriate tools from available MCP servers, and execute them effectively. This approach allows for systematic measurement of tool-use proficiency across various complex scenarios.

Quick Start & Requirements

Installation involves cloning the repository (https://github.com/accenture/mcp-bench.git), setting up a Conda environment with Python 3.10, and running an installation script for MCP servers. Key requirements include obtaining API keys for multiple services (OpenRouter, Azure OpenAI, NPS, NASA, Hugging Face, Google Maps, NCI). Some API key registration portals may require a US IP address. Model providers can be browsed at https://openrouter.ai/models.

Highlighted Details

  • Benchmarks LLM agents' tool-use capabilities across 28 diverse MCP servers, including Google Maps, Hugging Face, and NASA Data.
  • Features a leaderboard showcasing performance scores for leading LLMs like GPT-5, Gemini-2.5-Pro, and Claude-Sonnet-4.
  • Supports extensibility by allowing the addition of new model providers, such as integrating Azure models via OpenRouter.
  • Evaluates LLMs across multiple dimensions including schema understanding, task completion, tool usage, and planning effectiveness.

Maintenance & Community

The project's primary documentation is the README. Specific details regarding active maintenance, community channels (like Discord or Slack), or sponsorships are not explicitly detailed in the provided text. The citation lists several authors, indicating research contributions.

Licensing & Compatibility

The specific open-source license for the MCP-Bench repository is not explicitly stated in the provided README content.

Limitations & Caveats

Acquiring necessary API keys for certain services might present challenges, as some registration portals may require a US IP address, as noted in Issue #10. The setup process necessitates configuring multiple external API keys, which can add complexity to the initial deployment.

Health Check
Last Commit

4 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
5
Star History
35 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.