madhurprash/mcp-bench
If you are the rightful owner of mcp-bench and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
MCPBench is a framework designed to benchmark AI agents' tool-selection accuracy and performance on a local MCP math server.
MCPBench: Benchmarking Math MCP Tool Selection
MCPBench is a lightweight framework to benchmark AI agentsβ tool-selection accuracy and end-to-end performance on a local MCP math server. It leverages:
- A simple MCP math server (
math_mcp_server.py
) exposing basic arithmetic tools. - A LangGraph-based React-style agent (
langgraph_local_mcp_client.py
) using Bedrock or any Chat LLM. - A benchmark mode that runs a suite of math tasks, logs detailed metrics, and outputs a CSV summary.
π Prerequisites
-
Install UV (Astralβs CLI) and create a Python venv:
curl -LsSf https://astral.sh/uv/install.sh | sh export PATH="$HOME/.local/bin:$PATH" uv venv && source .venv/bin/activate && uv pip sync pyproject.toml export UV_PROJECT_ENVIRONMENT=.venv uv add zmq python -m ipykernel install --user --name=.venv --display-name="Python (uv env)"
-
Set up AWS credentials (if you plan to use Bedrock as your LLM):
export AWS_PROFILE=your-aws-profile export AWS_REGION=us-west-2 export BEDROCK_MODEL_ID=us.amazon.nova-lite-v1:0
π Project Structure
mcp-bench/
βββ agents/
β βββ langgraph_local_mcp_client.py # React-style client + benchmark driver
β βββ math_results.csv # (auto-generated) per-task benchmark output
β βββ benchmark_summary.txt # (auto-generated) aggregated metrics summary
βββ eval_functions/
β βββ benchmark_utils.py # run_single() + write_summary() helpers
βββ servers/
β βββ math_mcp_server.py # MCP math tool server
βββ math_env/
β βββ tasks.json # Benchmark task definitions (id, question, expected, tool)
βββ few_shot_data/ # Optional few-shot examples for in-prompt priming
βββ globals.py # Paths & constants (e.g. server FPATHs, FEW_SHOT_FPATH)
βββ pyproject.toml # Project dependencies
βββ uv.lock # UV environment lockfile
βββ README.md # This file: setup, usage, CSV schema, extension notes
π Running the MCP Server
- Open a terminal and navigate to the servers folder:
cd servers
- Launch the math MCP server:
python math_mcp_server.py
The server will expose all @mcp.tool() functions (add, subtract, multiply, divide, etc.) over stdio
.
π€ Interactive Client Mode
- In a new terminal, activate your
UV
venv
and navigate to agents:
cd agents
- Run the client in
REPL
mode:
uv run langgraph_local_mcp_client.py
- Ask questions interactively:
>>> What is 7 + 8?
15
>>> Calculate 6 * 9
54
>>> quit
Goodbye!
π Benchmark Mode
To execute the full task suite and generate a CSV report:
uv run langgraph_local_mcp_client.py \
--benchmark \
--output math_results.csv \
--model-id $BEDROCK_MODEL_ID \
--recursions 30
-
--benchmark
β run in batch mode -
--output
β path to write the CSV -
--model-id
β specify any supported ChatLLM model -
--recursions
β adjust the agentβs graph recursion limit
Example results
View results on different tool calling metrics using Amazon Nova Lite
below:
MCPBench Summary
----------------
Total Tasks : 22
Avg Latency S : 4.226772727272728
Tool Selection Accuracy : 0.7727272727272727
Answer Accuracy : 0.7727272727272727
Raw results:
id | question | used_tool | tool_ground_truth | tool_selection_accuracy | args | result | expected | correct_ans | latency | input_tokens | output_tokens | total_tokens | error |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | What is 12 / 5? | divide | divide | True | {'a': 12, 'b': 5} | 2.4 | 2.4 | True | 4.009 | ||||
2 | Compute 7 * 8 | multiply | multiply | True | {'a': 7, 'b': 8} | 56 | 56 | True | 2.492 | ||||
3 | Calculate 15 + 27 | add | add | True | {'a': 15, 'b': 27} | 42 | 42 | True | 2.401 | ||||
4 | What is 100 - 35? | subtract | subtract | True | {'a': 100, 'b': 35} | 65 | 65 | True | 2.323 | ||||
β¦ | β¦ | β¦ | β¦ | β¦ | β¦ | β¦ | β¦ | β¦ | β¦ | β¦ | β¦ | β¦ | β¦ |
22 | Calculate the volume of a cube with side length 4 | volume | volume | True | {'shape': 'cube', 'kwargs': 'side=4'} | Error: ToolException('Error executing tool volume: Cube requires side length') Please fix your mistakes. | 64 | False | 15.960 |
Detailed logging
You will be able to see detailed logging of the traces when the benchmark mode is on:
Processing request of type ListToolsRequest
βΆ Task 1: What is 12 / 5?
Processing request of type CallToolRequest
--> Latency: 3.168s
--> Tool call: divide({'a': 12, 'b': 5})
--> Raw tool result: 2.4
--> Tokens: input=None, output=None, total=None
--> Expected: 2.4 β β
PASS
Once done, open agents/math_results.csv
to review:
π CSV Output Columns
Column | Description |
---|---|
id | Task identifier |
question | The natural-language math problem |
used_tool | Name of the tool the agent invoked |
tool_ground_truth | The correct tool that should have been called |
tool_selection_accuracy | true if used_tool == tool_ground_truth , else false |
args | Dictionary of arguments passed to the tool |
result | Raw output returned by the tool |
expected | Gold-standard answer |
correct_ans | true if result == expected (with float tolerance) |
latency | Round-trip time in seconds |
input_tokens | Number of tokens in the prompt sent to the LLM |
output_tokens | Number of tokens in the LLMβs response |
total_tokens | Sum of input + output tokens |
error | Any error message if a call failed |
π§ Extending MCPBench
- Add new tasks: edit
math_env/tasks.json
(each entry needsid
,question
,expected
,tool
). - Few-shot data: place JSON examples in
few_shot_data/
and updateglobals.py
to pointFEW_SHOT_FPATH
. - Custom tools: in
servers/math_mcp_server.py
, decorate new functions with@mcp.tool()
and restart the server. - Swap LLMs: pass any
--model-id
supported by your provider (e.g.gpt-4o
,claude-3-7-sonnet
,mistral-large-2407
).
π License
This project is released under the MIT License. See for details.