spark-mcp-server by adithyakeshav - MCP Server

Spark History Server MCP Proxy

A stdio-based MCP server that exposes Spark History Server data through 4 standardized MCP tools. This server connects MCP clients directly to Spark History Server REST APIs using the Model Context Protocol.

🚀 What It Does

Provides 4 MCP tools that query your Spark History Server:

get_applications - List Spark applications with optional filtering
get_application_info - Get detailed application information
get_application_jobs - Get jobs for a specific application
get_application_stages - Get stages for a specific application

🏗️ Architecture

┌─────────────────┐    stdio/MCP    ┌──────────────────┐    HTTP     ┌──────────────────┐
│   MCP Client    │ ──────────────── │   MCP Server     │ ─────────── │ Spark History    │
│                 │                  │   (This Project) │             │ Server :18080    │
└─────────────────┘                  └──────────────────┘             └──────────────────┘

Important: This uses stdio-based MCP protocol (not HTTP endpoints), meaning:

Communication via stdin/stdout, not network ports
Uses JSON-RPC 2.0 for message formatting
Requires the official mcp>=1.13.0 package

✨ Current Features

✅ Standard MCP Protocol: Uses official MCP SDK with stdio transport
✅ Direct History Server Access: Pure proxy to Spark History Server REST API
✅ 4 Core Tools: Essential application, job, and stage data access
✅ Stateless Operation: No local storage or data persistence required
✅ Simple Configuration: Just a History Server URL needed

🛠️ Available MCP Tools

Tool	Description	Parameters	Status
`get_applications`	List all applications	`status` (optional), `limit` (optional)	✅ Implemented
`get_application_info`	Get application details	`app_id` (required)	✅ Implemented
`get_application_jobs`	Get application jobs	`app_id` (required)	✅ Implemented
`get_application_stages`	Get application stages	`app_id` (required), `status` (optional)	✅ Implemented

🚧 Not Implemented Yet

These tools are planned for future implementation:

Tool	Description	Status
`get_application_executors`	Get executor information	🚧 Planned
`get_application_environment`	Get environment details	🚧 Planned
`get_job_info`	Get specific job details	🚧 Planned
`get_stage_info`	Get specific stage details	🚧 Planned
`get_stage_tasks`	Get stage task details	🚧 Planned
`get_rdd_storage`	Get RDD storage info	🚧 Planned
`get_sql_queries`	Get SQL execution data	🚧 Planned
`get_streaming_batches`	Get streaming batch data	🚧 Planned

💻 Quick Start

Prerequisites

Spark History Server running and accessible (typically on port 18080)
Python 3.8+
MCP-compatible client

Installation

Clone the repository:

git clone <repository-url>
cd spark-mcp-server

Install dependencies:

# Dependencies are already installed in the virtual environment
# If you need to reinstall:
./env/bin/python3 -m pip install -r requirements.txt

Start Spark History Server:
```
./start_history_server.sh
```
This will start the History Server on http://localhost:18080

⚠️ Important - Event Logs Directory: The Spark History Server reads event logs from /tmp/spark-events/ directory. For the server to show data:

All Spark applications must write event logs to this directory
Set spark.eventLog.dir=/tmp/spark-events in your Spark configuration
Or use environment variable: export SPARK_EVENTLOG_DIR=/tmp/spark-events
Ensure this directory exists and is accessible

MCP Client Setup

Configure your MCP client to use this server. This setup is essential for proper MCP stdio protocol communication:

"spark-history-server": {
  "command": "/path/to/spark-mcp-server/env/bin/python3",
  "source": "custom",
  "args": [
    "/path/to/spark-mcp-server/src/main.py",
    "--config",
    "/path/to/spark-mcp-server/config.json"
  ],
  "env": {
    "PYTHONPATH": "/path/to/spark-mcp-server"
  }
}

⚠️ Important:

Update the paths to match your actual installation directory
The server uses stdio protocol, not HTTP endpoints
Requires MCP SDK (mcp>=1.13.0) to be installed

🧪 Testing

Generate Sample Data

Run sample Spark application:
```
python3 test-files/sample_spark_app.py
```
Note: The sample app is configured to write event logs to /tmp/spark-events/ which matches the History Server configuration.
Start History Server:
```
./start_history_server.sh
```

Verify data is available:

curl "http://localhost:18080/api/v1/applications?limit=3"

💡 For Your Own Spark Applications: To make your Spark applications visible in the History Server, ensure they write event logs to the same directory:

# Using spark-submit
spark-submit \
  --conf spark.eventLog.enabled=true \
  --conf spark.eventLog.dir=/tmp/spark-events \
  your_app.py

# Using environment variable
export SPARK_EVENTLOG_DIR=/tmp/spark-events
spark-submit --conf spark.eventLog.enabled=true your_app.py

# In PySpark code
spark = SparkSession.builder \
  .config("spark.eventLog.enabled", "true") \
  .config("spark.eventLog.dir", "/tmp/spark-events") \
  .getOrCreate()

Test MCP Server

# Test version
./env/bin/python3 src/main.py --version

# Test MCP protocol (initialize message)
echo '{"jsonrpc": "2.0", "id": 1, "method": "initialize", "params": {"protocolVersion": "2024-11-05", "capabilities": {}, "clientInfo": {"name": "test", "version": "1.0"}}}' | \
./env/bin/python3 src/main.py --config config.json

# Test tools listing
echo '{"jsonrpc": "2.0", "id": 2, "method": "tools/list"}' | \
./env/bin/python3 src/main.py --config config.json

📊 Response Examples

Applications List

{
  "success": true,
  "data": [
    {
      "id": "local-1755324061532",
      "name": "MCP-Test-Sample-Application",
      "attempts": [{
        "startTime": "2025-08-16T06:01:01.024GMT",
        "endTime": "2025-08-16T06:01:45.732GMT",
        "completed": true,
        "sparkUser": "username",
        "appSparkVersion": "3.3.1"
      }]
    }
  ],
  "count": 1,
  "message": "Retrieved 1 applications"
}

Application Jobs

{
  "success": true,
  "data": [
    {
      "jobId": 0,
      "name": "count at NativeMethodAccessorImpl.java:0",
      "status": "SUCCEEDED",
      "numTasks": 8,
      "numCompletedTasks": 8,
      "submissionTime": "2025-08-16T06:01:03.732GMT",
      "completionTime": "2025-08-16T06:01:04.752GMT"
    }
  ],
  "count": 34,
  "message": "Retrieved 34 jobs for application local-1755324061532"
}

📁 Project Structure

spark-mcp-server/
├── src/
│   ├── main.py              # MCP server entry point (stdio-based)
│   ├── history_client.py    # Spark History Server HTTP client
│   └── mcp_server.py       # Original implementation (unused)
├── config.json             # Server configuration
├── start_history_server.sh # History Server startup script
├── requirements.txt        # Python dependencies
├── env/                    # Virtual environment (with MCP SDK)
├── test-files/            # Development & test files
└── README.md              # This file

⚙️ Configuration Options

Basic Configuration

{
  "spark_history_server": {
    "url": "http://localhost:18080"
  },
  "logging": {
    "level": "INFO",
    "console": true
  }
}

With Authentication

{
  "spark_history_server": {
    "url": "https://spark-history.company.com:18080",
    "auth": {
      "type": "basic",
      "username": "spark_user",
      "password": "${SPARK_PASSWORD}"
    }
  },
  "logging": {
    "level": "INFO",
    "console": true,
    "file": "/var/log/spark-mcp.log"
  }
}

🎯 Use Cases

Performance Analysis: Query job/stage execution times and resource usage
Monitoring Integration: Feed Spark metrics into monitoring dashboards
Development Tools: IDE integration for Spark application monitoring
CI/CD Pipelines: Automated Spark job status checking
Data Engineering: Programmatic access to Spark execution metadata

🚧 Current Limitations

Limited Tools: Only 4 of the planned 15+ tools are implemented
Basic Error Handling: Minimal error handling and retry logic
Basic Authentication: Implemented but needs additional testing
Single History Server: No support for multiple History Server instances
Requires Setup: Client must have correct paths configured

📜 License

Apache License 2.0 - see LICENSE file for details.

🐛 Support

Check test-files/ directory for examples and troubleshooting
Review logs when running with "level": "DEBUG" in config
Ensure Spark History Server is accessible at the configured URL
See MCP_CONFIGURATION_FIX.md for detailed troubleshooting information

Current Status: ✅ Working MCP Server with 4 core tools implemented.