adithyakeshav/spark-mcp-server
If you are the rightful owner of spark-mcp-server and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.
The Spark History Server MCP Proxy is a stdio-based MCP server that provides access to Spark History Server data through standardized MCP tools.
Spark History Server MCP Proxy
A stdio-based MCP server that exposes Spark History Server data through 4 standardized MCP tools. This server connects MCP clients directly to Spark History Server REST APIs using the Model Context Protocol.
🚀 What It Does
Provides 4 MCP tools that query your Spark History Server:
- get_applications - List Spark applications with optional filtering
- get_application_info - Get detailed application information
- get_application_jobs - Get jobs for a specific application
- get_application_stages - Get stages for a specific application
🏗️ Architecture
┌─────────────────┐ stdio/MCP ┌──────────────────┐ HTTP ┌──────────────────┐
│ MCP Client │ ──────────────── │ MCP Server │ ─────────── │ Spark History │
│ │ │ (This Project) │ │ Server :18080 │
└─────────────────┘ └──────────────────┘ └──────────────────┘
Important: This uses stdio-based MCP protocol (not HTTP endpoints), meaning:
- Communication via stdin/stdout, not network ports
- Uses JSON-RPC 2.0 for message formatting
- Requires the official
mcp>=1.13.0package
✨ Current Features
- ✅ Standard MCP Protocol: Uses official MCP SDK with stdio transport
- ✅ Direct History Server Access: Pure proxy to Spark History Server REST API
- ✅ 4 Core Tools: Essential application, job, and stage data access
- ✅ Stateless Operation: No local storage or data persistence required
- ✅ Simple Configuration: Just a History Server URL needed
🛠️ Available MCP Tools
| Tool | Description | Parameters | Status |
|---|---|---|---|
get_applications | List all applications | status (optional), limit (optional) | ✅ Implemented |
get_application_info | Get application details | app_id (required) | ✅ Implemented |
get_application_jobs | Get application jobs | app_id (required) | ✅ Implemented |
get_application_stages | Get application stages | app_id (required), status (optional) | ✅ Implemented |
🚧 Not Implemented Yet
These tools are planned for future implementation:
| Tool | Description | Status |
|---|---|---|
get_application_executors | Get executor information | 🚧 Planned |
get_application_environment | Get environment details | 🚧 Planned |
get_job_info | Get specific job details | 🚧 Planned |
get_stage_info | Get specific stage details | 🚧 Planned |
get_stage_tasks | Get stage task details | 🚧 Planned |
get_rdd_storage | Get RDD storage info | 🚧 Planned |
get_sql_queries | Get SQL execution data | 🚧 Planned |
get_streaming_batches | Get streaming batch data | 🚧 Planned |
💻 Quick Start
Prerequisites
- Spark History Server running and accessible (typically on port 18080)
- Python 3.8+
- MCP-compatible client
Installation
-
Clone the repository:
git clone <repository-url> cd spark-mcp-server -
Install dependencies:
# Dependencies are already installed in the virtual environment # If you need to reinstall: ./env/bin/python3 -m pip install -r requirements.txt -
Start Spark History Server:
./start_history_server.shThis will start the History Server on http://localhost:18080
⚠️ Important - Event Logs Directory:
The Spark History Server reads event logs from /tmp/spark-events/ directory. For the server to show data:
- All Spark applications must write event logs to this directory
- Set
spark.eventLog.dir=/tmp/spark-eventsin your Spark configuration - Or use environment variable:
export SPARK_EVENTLOG_DIR=/tmp/spark-events - Ensure this directory exists and is accessible
MCP Client Setup
Configure your MCP client to use this server. This setup is essential for proper MCP stdio protocol communication:
"spark-history-server": {
"command": "/path/to/spark-mcp-server/env/bin/python3",
"source": "custom",
"args": [
"/path/to/spark-mcp-server/src/main.py",
"--config",
"/path/to/spark-mcp-server/config.json"
],
"env": {
"PYTHONPATH": "/path/to/spark-mcp-server"
}
}
⚠️ Important:
- Update the paths to match your actual installation directory
- The server uses stdio protocol, not HTTP endpoints
- Requires MCP SDK (mcp>=1.13.0) to be installed
🧪 Testing
Generate Sample Data
-
Run sample Spark application:
python3 test-files/sample_spark_app.pyNote: The sample app is configured to write event logs to
/tmp/spark-events/which matches the History Server configuration. -
Start History Server:
./start_history_server.sh -
Verify data is available:
curl "http://localhost:18080/api/v1/applications?limit=3"
💡 For Your Own Spark Applications: To make your Spark applications visible in the History Server, ensure they write event logs to the same directory:
# Using spark-submit
spark-submit \
--conf spark.eventLog.enabled=true \
--conf spark.eventLog.dir=/tmp/spark-events \
your_app.py
# Using environment variable
export SPARK_EVENTLOG_DIR=/tmp/spark-events
spark-submit --conf spark.eventLog.enabled=true your_app.py
# In PySpark code
spark = SparkSession.builder \
.config("spark.eventLog.enabled", "true") \
.config("spark.eventLog.dir", "/tmp/spark-events") \
.getOrCreate()
Test MCP Server
# Test version
./env/bin/python3 src/main.py --version
# Test MCP protocol (initialize message)
echo '{"jsonrpc": "2.0", "id": 1, "method": "initialize", "params": {"protocolVersion": "2024-11-05", "capabilities": {}, "clientInfo": {"name": "test", "version": "1.0"}}}' | \
./env/bin/python3 src/main.py --config config.json
# Test tools listing
echo '{"jsonrpc": "2.0", "id": 2, "method": "tools/list"}' | \
./env/bin/python3 src/main.py --config config.json
📊 Response Examples
Applications List
{
"success": true,
"data": [
{
"id": "local-1755324061532",
"name": "MCP-Test-Sample-Application",
"attempts": [{
"startTime": "2025-08-16T06:01:01.024GMT",
"endTime": "2025-08-16T06:01:45.732GMT",
"completed": true,
"sparkUser": "username",
"appSparkVersion": "3.3.1"
}]
}
],
"count": 1,
"message": "Retrieved 1 applications"
}
Application Jobs
{
"success": true,
"data": [
{
"jobId": 0,
"name": "count at NativeMethodAccessorImpl.java:0",
"status": "SUCCEEDED",
"numTasks": 8,
"numCompletedTasks": 8,
"submissionTime": "2025-08-16T06:01:03.732GMT",
"completionTime": "2025-08-16T06:01:04.752GMT"
}
],
"count": 34,
"message": "Retrieved 34 jobs for application local-1755324061532"
}
📁 Project Structure
spark-mcp-server/
├── src/
│ ├── main.py # MCP server entry point (stdio-based)
│ ├── history_client.py # Spark History Server HTTP client
│ └── mcp_server.py # Original implementation (unused)
├── config.json # Server configuration
├── start_history_server.sh # History Server startup script
├── requirements.txt # Python dependencies
├── env/ # Virtual environment (with MCP SDK)
├── test-files/ # Development & test files
└── README.md # This file
⚙️ Configuration Options
Basic Configuration
{
"spark_history_server": {
"url": "http://localhost:18080"
},
"logging": {
"level": "INFO",
"console": true
}
}
With Authentication
{
"spark_history_server": {
"url": "https://spark-history.company.com:18080",
"auth": {
"type": "basic",
"username": "spark_user",
"password": "${SPARK_PASSWORD}"
}
},
"logging": {
"level": "INFO",
"console": true,
"file": "/var/log/spark-mcp.log"
}
}
🎯 Use Cases
- Performance Analysis: Query job/stage execution times and resource usage
- Monitoring Integration: Feed Spark metrics into monitoring dashboards
- Development Tools: IDE integration for Spark application monitoring
- CI/CD Pipelines: Automated Spark job status checking
- Data Engineering: Programmatic access to Spark execution metadata
🚧 Current Limitations
- Limited Tools: Only 4 of the planned 15+ tools are implemented
- Basic Error Handling: Minimal error handling and retry logic
- Basic Authentication: Implemented but needs additional testing
- Single History Server: No support for multiple History Server instances
- Requires Setup: Client must have correct paths configured
📜 License
Apache License 2.0 - see LICENSE file for details.
🐛 Support
- Check
test-files/directory for examples and troubleshooting - Review logs when running with
"level": "DEBUG"in config - Ensure Spark History Server is accessible at the configured URL
- See
MCP_CONFIGURATION_FIX.mdfor detailed troubleshooting information
Current Status: ✅ Working MCP Server with 4 core tools implemented.