spark-mcp-server

adithyakeshav/spark-mcp-server

3.1

If you are the rightful owner of spark-mcp-server and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.

The Spark History Server MCP Proxy is a stdio-based MCP server that provides access to Spark History Server data through standardized MCP tools.

Tools
4
Resources
0
Prompts
0

Spark History Server MCP Proxy

A stdio-based MCP server that exposes Spark History Server data through 4 standardized MCP tools. This server connects MCP clients directly to Spark History Server REST APIs using the Model Context Protocol.

🚀 What It Does

Provides 4 MCP tools that query your Spark History Server:

  1. get_applications - List Spark applications with optional filtering
  2. get_application_info - Get detailed application information
  3. get_application_jobs - Get jobs for a specific application
  4. get_application_stages - Get stages for a specific application

🏗️ Architecture

┌─────────────────┐    stdio/MCP    ┌──────────────────┐    HTTP     ┌──────────────────┐
│   MCP Client    │ ──────────────── │   MCP Server     │ ─────────── │ Spark History    │
│                 │                  │   (This Project) │             │ Server :18080    │
└─────────────────┘                  └──────────────────┘             └──────────────────┘

Important: This uses stdio-based MCP protocol (not HTTP endpoints), meaning:

  • Communication via stdin/stdout, not network ports
  • Uses JSON-RPC 2.0 for message formatting
  • Requires the official mcp>=1.13.0 package

✨ Current Features

  • ✅ Standard MCP Protocol: Uses official MCP SDK with stdio transport
  • ✅ Direct History Server Access: Pure proxy to Spark History Server REST API
  • ✅ 4 Core Tools: Essential application, job, and stage data access
  • ✅ Stateless Operation: No local storage or data persistence required
  • ✅ Simple Configuration: Just a History Server URL needed

🛠️ Available MCP Tools

ToolDescriptionParametersStatus
get_applicationsList all applicationsstatus (optional), limit (optional)Implemented
get_application_infoGet application detailsapp_id (required)Implemented
get_application_jobsGet application jobsapp_id (required)Implemented
get_application_stagesGet application stagesapp_id (required), status (optional)Implemented

🚧 Not Implemented Yet

These tools are planned for future implementation:

ToolDescriptionStatus
get_application_executorsGet executor information🚧 Planned
get_application_environmentGet environment details🚧 Planned
get_job_infoGet specific job details🚧 Planned
get_stage_infoGet specific stage details🚧 Planned
get_stage_tasksGet stage task details🚧 Planned
get_rdd_storageGet RDD storage info🚧 Planned
get_sql_queriesGet SQL execution data🚧 Planned
get_streaming_batchesGet streaming batch data🚧 Planned

💻 Quick Start

Prerequisites

  • Spark History Server running and accessible (typically on port 18080)
  • Python 3.8+
  • MCP-compatible client

Installation

  1. Clone the repository:

    git clone <repository-url>
    cd spark-mcp-server
    
  2. Install dependencies:

    # Dependencies are already installed in the virtual environment
    # If you need to reinstall:
    ./env/bin/python3 -m pip install -r requirements.txt
    
  3. Start Spark History Server:

    ./start_history_server.sh
    

    This will start the History Server on http://localhost:18080

⚠️ Important - Event Logs Directory: The Spark History Server reads event logs from /tmp/spark-events/ directory. For the server to show data:

  • All Spark applications must write event logs to this directory
  • Set spark.eventLog.dir=/tmp/spark-events in your Spark configuration
  • Or use environment variable: export SPARK_EVENTLOG_DIR=/tmp/spark-events
  • Ensure this directory exists and is accessible

MCP Client Setup

Configure your MCP client to use this server. This setup is essential for proper MCP stdio protocol communication:

"spark-history-server": {
  "command": "/path/to/spark-mcp-server/env/bin/python3",
  "source": "custom",
  "args": [
    "/path/to/spark-mcp-server/src/main.py",
    "--config",
    "/path/to/spark-mcp-server/config.json"
  ],
  "env": {
    "PYTHONPATH": "/path/to/spark-mcp-server"
  }
}

⚠️ Important:

  • Update the paths to match your actual installation directory
  • The server uses stdio protocol, not HTTP endpoints
  • Requires MCP SDK (mcp>=1.13.0) to be installed

🧪 Testing

Generate Sample Data

  1. Run sample Spark application:

    python3 test-files/sample_spark_app.py
    

    Note: The sample app is configured to write event logs to /tmp/spark-events/ which matches the History Server configuration.

  2. Start History Server:

    ./start_history_server.sh
    
  3. Verify data is available:

    curl "http://localhost:18080/api/v1/applications?limit=3"
    

💡 For Your Own Spark Applications: To make your Spark applications visible in the History Server, ensure they write event logs to the same directory:

# Using spark-submit
spark-submit \
  --conf spark.eventLog.enabled=true \
  --conf spark.eventLog.dir=/tmp/spark-events \
  your_app.py

# Using environment variable
export SPARK_EVENTLOG_DIR=/tmp/spark-events
spark-submit --conf spark.eventLog.enabled=true your_app.py

# In PySpark code
spark = SparkSession.builder \
  .config("spark.eventLog.enabled", "true") \
  .config("spark.eventLog.dir", "/tmp/spark-events") \
  .getOrCreate()

Test MCP Server

# Test version
./env/bin/python3 src/main.py --version

# Test MCP protocol (initialize message)
echo '{"jsonrpc": "2.0", "id": 1, "method": "initialize", "params": {"protocolVersion": "2024-11-05", "capabilities": {}, "clientInfo": {"name": "test", "version": "1.0"}}}' | \
./env/bin/python3 src/main.py --config config.json

# Test tools listing
echo '{"jsonrpc": "2.0", "id": 2, "method": "tools/list"}' | \
./env/bin/python3 src/main.py --config config.json

📊 Response Examples

Applications List

{
  "success": true,
  "data": [
    {
      "id": "local-1755324061532",
      "name": "MCP-Test-Sample-Application",
      "attempts": [{
        "startTime": "2025-08-16T06:01:01.024GMT",
        "endTime": "2025-08-16T06:01:45.732GMT",
        "completed": true,
        "sparkUser": "username",
        "appSparkVersion": "3.3.1"
      }]
    }
  ],
  "count": 1,
  "message": "Retrieved 1 applications"
}

Application Jobs

{
  "success": true,
  "data": [
    {
      "jobId": 0,
      "name": "count at NativeMethodAccessorImpl.java:0",
      "status": "SUCCEEDED",
      "numTasks": 8,
      "numCompletedTasks": 8,
      "submissionTime": "2025-08-16T06:01:03.732GMT",
      "completionTime": "2025-08-16T06:01:04.752GMT"
    }
  ],
  "count": 34,
  "message": "Retrieved 34 jobs for application local-1755324061532"
}

📁 Project Structure

spark-mcp-server/
├── src/
│   ├── main.py              # MCP server entry point (stdio-based)
│   ├── history_client.py    # Spark History Server HTTP client
│   └── mcp_server.py       # Original implementation (unused)
├── config.json             # Server configuration
├── start_history_server.sh # History Server startup script
├── requirements.txt        # Python dependencies
├── env/                    # Virtual environment (with MCP SDK)
├── test-files/            # Development & test files
└── README.md              # This file

⚙️ Configuration Options

Basic Configuration

{
  "spark_history_server": {
    "url": "http://localhost:18080"
  },
  "logging": {
    "level": "INFO",
    "console": true
  }
}

With Authentication

{
  "spark_history_server": {
    "url": "https://spark-history.company.com:18080",
    "auth": {
      "type": "basic",
      "username": "spark_user",
      "password": "${SPARK_PASSWORD}"
    }
  },
  "logging": {
    "level": "INFO",
    "console": true,
    "file": "/var/log/spark-mcp.log"
  }
}

🎯 Use Cases

  • Performance Analysis: Query job/stage execution times and resource usage
  • Monitoring Integration: Feed Spark metrics into monitoring dashboards
  • Development Tools: IDE integration for Spark application monitoring
  • CI/CD Pipelines: Automated Spark job status checking
  • Data Engineering: Programmatic access to Spark execution metadata

🚧 Current Limitations

  • Limited Tools: Only 4 of the planned 15+ tools are implemented
  • Basic Error Handling: Minimal error handling and retry logic
  • Basic Authentication: Implemented but needs additional testing
  • Single History Server: No support for multiple History Server instances
  • Requires Setup: Client must have correct paths configured

📜 License

Apache License 2.0 - see LICENSE file for details.

🐛 Support

  • Check test-files/ directory for examples and troubleshooting
  • Review logs when running with "level": "DEBUG" in config
  • Ensure Spark History Server is accessible at the configured URL
  • See MCP_CONFIGURATION_FIX.md for detailed troubleshooting information

Current Status: ✅ Working MCP Server with 4 core tools implemented.