mcp-bench

madhurprash/mcp-bench

3.1

If you are the rightful owner of mcp-bench and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

MCPBench is a framework designed to benchmark AI agents' tool-selection accuracy and performance on a local MCP math server.

Tools
4
Resources
0
Prompts
0

MCPBench: Benchmarking Math MCP Tool Selection

MCPBench is a lightweight framework to benchmark AI agents’ tool-selection accuracy and end-to-end performance on a local MCP math server. It leverages:

  • A simple MCP math server (math_mcp_server.py) exposing basic arithmetic tools.
  • A LangGraph-based React-style agent (langgraph_local_mcp_client.py) using Bedrock or any Chat LLM.
  • A benchmark mode that runs a suite of math tasks, logs detailed metrics, and outputs a CSV summary.

🛠 Prerequisites

  1. Install UV (Astral’s CLI) and create a Python venv:

    curl -LsSf https://astral.sh/uv/install.sh | sh
    export PATH="$HOME/.local/bin:$PATH"
    uv venv && source .venv/bin/activate && uv pip sync pyproject.toml
    export UV_PROJECT_ENVIRONMENT=.venv
    uv add zmq
    python -m ipykernel install --user --name=.venv --display-name="Python (uv env)"
    
  2. Set up AWS credentials (if you plan to use Bedrock as your LLM):

    export AWS_PROFILE=your-aws-profile
    export AWS_REGION=us-west-2
    export BEDROCK_MODEL_ID=us.amazon.nova-lite-v1:0
    

📁 Project Structure

mcp-bench/
├── agents/
│   ├── langgraph_local_mcp_client.py   # React-style client + benchmark driver
│   ├── math_results.csv                # (auto-generated) per-task benchmark output
│   └── benchmark_summary.txt           # (auto-generated) aggregated metrics summary
├── eval_functions/
│   └── benchmark_utils.py              # run_single() + write_summary() helpers
├── servers/
│   └── math_mcp_server.py              # MCP math tool server
├── math_env/
│   └── tasks.json                      # Benchmark task definitions (id, question, expected, tool)
├── few_shot_data/                      # Optional few-shot examples for in-prompt priming
├── globals.py                          # Paths & constants (e.g. server FPATHs, FEW_SHOT_FPATH)
├── pyproject.toml                      # Project dependencies
├── uv.lock                             # UV environment lockfile
└── README.md                           # This file: setup, usage, CSV schema, extension notes

🚀 Running the MCP Server

  1. Open a terminal and navigate to the servers folder:
cd servers
  1. Launch the math MCP server:
python math_mcp_server.py

The server will expose all @mcp.tool() functions (add, subtract, multiply, divide, etc.) over stdio.

🤖 Interactive Client Mode

  1. In a new terminal, activate your UV venv and navigate to agents:
cd agents
  1. Run the client in REPL mode:
uv run langgraph_local_mcp_client.py
  1. Ask questions interactively:
>>> What is 7 + 8?
15
>>> Calculate 6 * 9
54
>>> quit
Goodbye!

📊 Benchmark Mode

To execute the full task suite and generate a CSV report:

uv run langgraph_local_mcp_client.py \
  --benchmark \
  --output math_results.csv \
  --model-id $BEDROCK_MODEL_ID \
  --recursions 30
  • --benchmark → run in batch mode

  • --output → path to write the CSV

  • --model-id → specify any supported ChatLLM model

  • --recursions→ adjust the agent’s graph recursion limit

Example results

View results on different tool calling metrics using Amazon Nova Lite below:

MCPBench Summary
----------------
Total Tasks              : 22
Avg Latency S            : 4.226772727272728
Tool Selection Accuracy  : 0.7727272727272727
Answer Accuracy          : 0.7727272727272727

Raw results:

idquestionused_tooltool_ground_truthtool_selection_accuracyargsresultexpectedcorrect_anslatencyinput_tokensoutput_tokenstotal_tokenserror
1What is 12 / 5?dividedivideTrue{'a': 12, 'b': 5}2.42.4True4.009
2Compute 7 * 8multiplymultiplyTrue{'a': 7, 'b': 8}5656True2.492
3Calculate 15 + 27addaddTrue{'a': 15, 'b': 27}4242True2.401
4What is 100 - 35?subtractsubtractTrue{'a': 100, 'b': 35}6565True2.323
22Calculate the volume of a cube with side length 4volumevolumeTrue{'shape': 'cube', 'kwargs': 'side=4'}Error: ToolException('Error executing tool volume: Cube requires side length')
Please fix your mistakes.
64False15.960

Detailed logging

You will be able to see detailed logging of the traces when the benchmark mode is on:

Processing request of type ListToolsRequest
▶ Task 1: What is 12 / 5?
Processing request of type CallToolRequest
--> Latency: 3.168s
--> Tool call: divide({'a': 12, 'b': 5})
--> Raw tool result: 2.4
--> Tokens: input=None, output=None, total=None
--> Expected: 2.4 → ✅ PASS

Once done, open agents/math_results.csv to review:

📝 CSV Output Columns

ColumnDescription
idTask identifier
questionThe natural-language math problem
used_toolName of the tool the agent invoked
tool_ground_truthThe correct tool that should have been called
tool_selection_accuracytrue if used_tool == tool_ground_truth, else false
argsDictionary of arguments passed to the tool
resultRaw output returned by the tool
expectedGold-standard answer
correct_anstrue if result == expected (with float tolerance)
latencyRound-trip time in seconds
input_tokensNumber of tokens in the prompt sent to the LLM
output_tokensNumber of tokens in the LLM’s response
total_tokensSum of input + output tokens
errorAny error message if a call failed

🔧 Extending MCPBench

  • Add new tasks: edit math_env/tasks.json (each entry needs id, question, expected, tool).
  • Few-shot data: place JSON examples in few_shot_data/ and update globals.py to point FEW_SHOT_FPATH.
  • Custom tools: in servers/math_mcp_server.py, decorate new functions with @mcp.tool() and restart the server.
  • Swap LLMs: pass any --model-id supported by your provider (e.g. gpt-4o, claude-3-7-sonnet, mistral-large-2407).

📄 License

This project is released under the MIT License. See for details.

mcp-bench

mcp-bench