mcp-bench by Accenture

Evaluate LLM agents' tool-use capabilities

Created 6 months ago

450 stars

Top 66.8% on SourcePulse

View on GitHub

1 Expert Loves This Project

Lewis Tunstall

Research Engineer at Hugging Face

Project Summary

MCP-Bench is an evaluation framework for benchmarking Large Language Model (LLM) agents' tool-use capabilities in complex, real-world tasks via the Model Context Protocol (MCP). It offers an end-to-end pipeline for assessing how effectively LLMs discover, select, and utilize tools, providing valuable insights for researchers and developers in the LLM agent space.

How It Works

The framework employs the Model Context Protocol (MCP) to facilitate LLM interaction with a suite of 28 diverse real-world services, referred to as MCP servers. It orchestrates benchmark runs, evaluating LLMs on their ability to understand task requirements, select appropriate tools from available MCP servers, and execute them effectively. This approach allows for systematic measurement of tool-use proficiency across various complex scenarios.

Quick Start & Requirements

Installation involves cloning the repository (https://github.com/accenture/mcp-bench.git), setting up a Conda environment with Python 3.10, and running an installation script for MCP servers. Key requirements include obtaining API keys for multiple services (OpenRouter, Azure OpenAI, NPS, NASA, Hugging Face, Google Maps, NCI). Some API key registration portals may require a US IP address. Model providers can be browsed at https://openrouter.ai/models.

Highlighted Details

Benchmarks LLM agents' tool-use capabilities across 28 diverse MCP servers, including Google Maps, Hugging Face, and NASA Data.
Features a leaderboard showcasing performance scores for leading LLMs like GPT-5, Gemini-2.5-Pro, and Claude-Sonnet-4.
Supports extensibility by allowing the addition of new model providers, such as integrating Azure models via OpenRouter.
Evaluates LLMs across multiple dimensions including schema understanding, task completion, tool usage, and planning effectiveness.

Maintenance & Community

The project's primary documentation is the README. Specific details regarding active maintenance, community channels (like Discord or Slack), or sponsorships are not explicitly detailed in the provided text. The citation lists several authors, indicating research contributions.

Licensing & Compatibility

The specific open-source license for the MCP-Bench repository is not explicitly stated in the provided README content.

Limitations & Caveats

Acquiring necessary API keys for certain services might present challenges, as some registration portals may require a US IP address, as noted in Issue #10. The setup process necessitates configuring multiple external API keys, which can add complexity to the initial deployment.

Health Check

Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

18 stars in the last 30 days