mcp-bench

madhurprash/mcp-bench

3.2

If you are the rightful owner of mcp-bench and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

MCPBench is a framework designed to benchmark AI agents' tool-selection accuracy and performance on a local MCP math server.

Tools
4
Resources
0
Prompts
0

MCPBench: Benchmarking Math MCP Tool Selection

MCPBench is a lightweight framework to benchmark AI agents’ tool-selection accuracy and end-to-end performance on a local MCP math server. It leverages:

  • A simple MCP math server (math_mcp_server.py) exposing basic arithmetic tools.
  • A LangGraph-based React-style agent (langgraph_local_mcp_client.py) using Bedrock or any Chat LLM.
  • A benchmark mode that runs a suite of math tasks, logs detailed metrics, and outputs a CSV summary.

πŸ›  Prerequisites

  1. Install UV (Astral’s CLI) and create a Python venv:

    curl -LsSf https://astral.sh/uv/install.sh | sh
    export PATH="$HOME/.local/bin:$PATH"
    uv venv && source .venv/bin/activate && uv pip sync pyproject.toml
    export UV_PROJECT_ENVIRONMENT=.venv
    uv add zmq
    python -m ipykernel install --user --name=.venv --display-name="Python (uv env)"
    
  2. Set up AWS credentials (if you plan to use Bedrock as your LLM):

    export AWS_PROFILE=your-aws-profile
    export AWS_REGION=us-west-2
    export BEDROCK_MODEL_ID=us.amazon.nova-lite-v1:0
    

πŸ“ Project Structure

mcp-bench/
β”œβ”€β”€ agents/
β”‚   β”œβ”€β”€ langgraph_local_mcp_client.py   # React-style client + benchmark driver
β”‚   β”œβ”€β”€ math_results.csv                # (auto-generated) per-task benchmark output
β”‚   └── benchmark_summary.txt           # (auto-generated) aggregated metrics summary
β”œβ”€β”€ eval_functions/
β”‚   └── benchmark_utils.py              # run_single() + write_summary() helpers
β”œβ”€β”€ servers/
β”‚   └── math_mcp_server.py              # MCP math tool server
β”œβ”€β”€ math_env/
β”‚   └── tasks.json                      # Benchmark task definitions (id, question, expected, tool)
β”œβ”€β”€ few_shot_data/                      # Optional few-shot examples for in-prompt priming
β”œβ”€β”€ globals.py                          # Paths & constants (e.g. server FPATHs, FEW_SHOT_FPATH)
β”œβ”€β”€ pyproject.toml                      # Project dependencies
β”œβ”€β”€ uv.lock                             # UV environment lockfile
└── README.md                           # This file: setup, usage, CSV schema, extension notes

πŸš€ Running the MCP Server

  1. Open a terminal and navigate to the servers folder:
cd servers
  1. Launch the math MCP server:
python math_mcp_server.py

The server will expose all @mcp.tool() functions (add, subtract, multiply, divide, etc.) over stdio.

πŸ€– Interactive Client Mode

  1. In a new terminal, activate your UV venv and navigate to agents:
cd agents
  1. Run the client in REPL mode:
uv run langgraph_local_mcp_client.py
  1. Ask questions interactively:
>>> What is 7 + 8?
15
>>> Calculate 6 * 9
54
>>> quit
Goodbye!

πŸ“Š Benchmark Mode

To execute the full task suite and generate a CSV report:

uv run langgraph_local_mcp_client.py \
  --benchmark \
  --output math_results.csv \
  --model-id $BEDROCK_MODEL_ID \
  --recursions 30
  • --benchmark β†’ run in batch mode

  • --output β†’ path to write the CSV

  • --model-id β†’ specify any supported ChatLLM model

  • --recursionsβ†’ adjust the agent’s graph recursion limit

Example results

View results on different tool calling metrics using Amazon Nova Lite below:

MCPBench Summary
----------------
Total Tasks              : 22
Avg Latency S            : 4.226772727272728
Tool Selection Accuracy  : 0.7727272727272727
Answer Accuracy          : 0.7727272727272727

Raw results:

idquestionused_tooltool_ground_truthtool_selection_accuracyargsresultexpectedcorrect_anslatencyinput_tokensoutput_tokenstotal_tokenserror
1What is 12 / 5?dividedivideTrue{'a': 12, 'b': 5}2.42.4True4.009
2Compute 7 * 8multiplymultiplyTrue{'a': 7, 'b': 8}5656True2.492
3Calculate 15 + 27addaddTrue{'a': 15, 'b': 27}4242True2.401
4What is 100 - 35?subtractsubtractTrue{'a': 100, 'b': 35}6565True2.323
……………………………………
22Calculate the volume of a cube with side length 4volumevolumeTrue{'shape': 'cube', 'kwargs': 'side=4'}Error: ToolException('Error executing tool volume: Cube requires side length')
Please fix your mistakes.
64False15.960

Detailed logging

You will be able to see detailed logging of the traces when the benchmark mode is on:

Processing request of type ListToolsRequest
β–Ά Task 1: What is 12 / 5?
Processing request of type CallToolRequest
--> Latency: 3.168s
--> Tool call: divide({'a': 12, 'b': 5})
--> Raw tool result: 2.4
--> Tokens: input=None, output=None, total=None
--> Expected: 2.4 β†’ βœ… PASS

Once done, open agents/math_results.csv to review:

πŸ“ CSV Output Columns

ColumnDescription
idTask identifier
questionThe natural-language math problem
used_toolName of the tool the agent invoked
tool_ground_truthThe correct tool that should have been called
tool_selection_accuracytrue if used_tool == tool_ground_truth, else false
argsDictionary of arguments passed to the tool
resultRaw output returned by the tool
expectedGold-standard answer
correct_anstrue if result == expected (with float tolerance)
latencyRound-trip time in seconds
input_tokensNumber of tokens in the prompt sent to the LLM
output_tokensNumber of tokens in the LLM’s response
total_tokensSum of input + output tokens
errorAny error message if a call failed

πŸ”§ Extending MCPBench

  • Add new tasks: edit math_env/tasks.json (each entry needs id, question, expected, tool).
  • Few-shot data: place JSON examples in few_shot_data/ and update globals.py to point FEW_SHOT_FPATH.
  • Custom tools: in servers/math_mcp_server.py, decorate new functions with @mcp.tool() and restart the server.
  • Swap LLMs: pass any --model-id supported by your provider (e.g. gpt-4o, claude-3-7-sonnet, mistral-large-2407).

πŸ“„ License

This project is released under the MIT License. See for details.

mcp-bench

mcp-bench