mcp-apache-spark-history-server

DeepDiagnostix-AI/mcp-apache-spark-history-server

3.5

If you are the rightful owner of mcp-apache-spark-history-server and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

The MCP Server for Apache Spark History Server connects AI agents to the Spark History Server for intelligent job analysis and performance monitoring.

Tools
5
Resources
0
Prompts
0

MCP Server for Apache Spark History Server

CI Python 3.12+ MCP License

πŸ€– Connect AI agents to Apache Spark History Server for intelligent job analysis and performance monitoring

Transform your Spark infrastructure monitoring with AI! This Model Context Protocol (MCP) server enables AI agents to analyze job performance, identify bottlenecks, and provide intelligent insights from your Spark History Server data.

🎯 What is This?

Spark History Server MCP bridges AI agents with your existing Apache Spark infrastructure, enabling:

  • πŸ” Query job details through natural language
  • πŸ“Š Analyze performance metrics across applications
  • πŸ”„ Compare multiple jobs to identify regressions
  • 🚨 Investigate failures with detailed error analysis
  • πŸ“ˆ Generate insights from historical execution data

πŸ“Ί See it in action:

Watch the demo video

πŸ—οΈ Architecture

graph TB
    A[πŸ€– AI Agent/LLM] --> F[πŸ“‘ MCP Client]
    B[πŸ¦™ LlamaIndex Agent] --> F
    C[🌐 LangGraph] --> F
    D[�️ Claudep Desktop] --> F
    E[πŸ› οΈ Amazon Q CLI] --> F

    F --> G[⚑ Spark History MCP Server]

    G --> H[πŸ”₯ Prod Spark History Server]
    G --> I[πŸ”₯ Staging Spark History Server]
    G --> J[πŸ”₯ Dev Spark History Server]

    H --> K[πŸ“„ Prod Event Logs]
    I --> L[πŸ“„ Staging Event Logs]
    J --> M[πŸ“„ Dev Event Logs]

πŸ”— Components:

  • πŸ”₯ Spark History Server: Your existing infrastructure serving Spark event data
  • ⚑ MCP Server: This project - provides MCP tools for querying Spark data
  • πŸ€– AI Agents: LangChain, custom agents, or any MCP-compatible client

⚑ Quick Start

πŸ“‹ Prerequisites

  • πŸ”₯ Existing Spark History Server (running and accessible)
  • 🐍 Python 3.12+
  • ⚑ uv package manager

πŸš€ Setup & Testing

git clone https://github.com/DeepDiagnostix-AI/mcp-apache-spark-history-server.git
cd mcp-apache-spark-history-server

# Install Task (if not already installed)
brew install go-task  # macOS, see https://taskfile.dev/installation/ for others

# Setup and start testing
task start-spark-bg            # Start Spark History Server with sample data (default Spark 3.5.5)
# Or specify a different Spark version:
# task start-spark-bg spark_version=3.5.2
task start-mcp-bg             # Start MCP Server

# Optional: Opens MCP Inspector on http://localhost:6274 for interactive testing
# Requires Node.js: 22.7.5+ (Check https://github.com/modelcontextprotocol/inspector for latest requirements)
task start-inspector-bg       # Start MCP Inspector

# When done, run `task stop-all`

If you just want to run the MCP server without cloning the repository:

# Run with uv without installing the module
uvx --from mcp-apache-spark-history-server spark-mcp

# OR run with pip and python. Use of venv is highly encouraged.
python3 -m venv spark-mcp && source spark-mcp/bin/activate
pip install mcp-apache-spark-history-server
python3 -m spark_history_mcp.core.main
# Deactivate venv
deactivate

πŸ“Š Sample Data

The repository includes real Spark event logs for testing:

  • spark-bcec39f6201b42b9925124595baad260 - βœ… Successful ETL job
  • spark-110be3a8424d4a2789cb88134418217b - πŸ”„ Data processing job
  • spark-cc4d115f011443d787f03a71a476a745 - πŸ“ˆ Multi-stage analytics job

See for using them.

βš™οΈ Server Configuration

Edit config.yaml for your Spark History Server:

servers:
  local:
    default: true
    url: "http://your-spark-history-server:18080"
    auth:  # optional
      username: "user"
      password: "pass"
mcp:
  transports:
    - streamable-http # streamable-http or stdio.
  port: "18888"
  debug: true

πŸ“Έ Screenshots

πŸ” Get Spark Application

Get Application

⚑ Job Performance Comparison

Job Comparison

πŸ› οΈ Available Tools

Note: These tools are subject to change as we scale and improve the performance of the MCP server.

The MCP server provides 17 specialized tools organized by analysis patterns. LLMs can intelligently select and combine these tools based on user queries:

πŸ“Š Application Information

Basic application metadata and overview

πŸ”§ ToolπŸ“ Description
get_applicationπŸ“Š Get detailed information about a specific Spark application including status, resource usage, duration, and attempt details

πŸ”— Job Analysis

Job-level performance analysis and identification

πŸ”§ ToolπŸ“ Description
list_jobsπŸ”— Get a list of all jobs for a Spark application with optional status filtering
list_slowest_jobs⏱️ Get the N slowest jobs for a Spark application (excludes running jobs by default)

⚑ Stage Analysis

Stage-level performance deep dive and task metrics

πŸ”§ ToolπŸ“ Description
list_stages⚑ Get a list of all stages for a Spark application with optional status filtering and summaries
list_slowest_stages🐌 Get the N slowest stages for a Spark application (excludes running stages by default)
get_stage🎯 Get information about a specific stage with optional attempt ID and summary metrics
get_stage_task_summaryπŸ“Š Get statistical distributions of task metrics for a specific stage (execution times, memory usage, I/O metrics)

πŸ–₯️ Executor & Resource Analysis

Resource utilization, executor performance, and allocation tracking

πŸ”§ ToolπŸ“ Description
list_executorsπŸ–₯️ Get executor information with optional inactive executor inclusion
get_executorπŸ” Get information about a specific executor including resource allocation, task statistics, and performance metrics
get_executor_summaryπŸ“ˆ Aggregates metrics across all executors (memory usage, disk usage, task counts, performance metrics)
get_resource_usage_timelineπŸ“… Get chronological view of resource allocation and usage patterns including executor additions/removals

βš™οΈ Configuration & Environment

Spark configuration, environment variables, and runtime settings

πŸ”§ ToolπŸ“ Description
get_environmentβš™οΈ Get comprehensive Spark runtime configuration including JVM info, Spark properties, system properties, and classpath

πŸ”Ž SQL & Query Analysis

SQL performance analysis and execution plan comparison

πŸ”§ ToolπŸ“ Description
list_slowest_sql_queries🐌 Get the top N slowest SQL queries for an application with detailed execution metrics
compare_sql_execution_plansπŸ” Compare SQL execution plans between two Spark jobs, analyzing logical/physical plans and execution metrics

🚨 Performance & Bottleneck Analysis

Intelligent bottleneck identification and performance recommendations

πŸ”§ ToolπŸ“ Description
get_job_bottlenecks🚨 Identify performance bottlenecks by analyzing stages, tasks, and executors with actionable recommendations

πŸ”„ Comparative Analysis

Cross-application comparison for regression detection and optimization

πŸ”§ ToolπŸ“ Description
compare_job_environmentsβš™οΈ Compare Spark environment configurations between two jobs to identify differences in properties and settings
compare_job_performanceπŸ“ˆ Compare performance metrics between two Spark jobs including execution times, resource usage, and task distribution

πŸ€– How LLMs Use These Tools

Query Pattern Examples:

  • "Why is my job slow?" β†’ get_job_bottlenecks + list_slowest_stages + get_executor_summary
  • "Compare today vs yesterday" β†’ compare_job_performance + compare_job_environments
  • "What's wrong with stage 5?" β†’ get_stage + get_stage_task_summary
  • "Show me resource usage over time" β†’ get_resource_usage_timeline + get_executor_summary
  • "Find my slowest SQL queries" β†’ list_slowest_sql_queries + compare_sql_execution_plans

πŸ“” AWS Integration Guides

If you are an existing AWS user looking to analyze your Spark Applications, we provide detailed setup guides for:

  • - Connect to Glue Spark History Server
  • - Use EMR Persistent UI for Spark analysis

These guides provide step-by-step instructions for setting up the Spark History Server MCP with your AWS services.

πŸš€ Kubernetes Deployment

Deploy using Kubernetes with Helm:

⚠️ Work in Progress: We are still testing and will soon publish the container image and Helm registry to GitHub for easy deployment.

# πŸ“¦ Deploy with Helm
helm install spark-history-mcp ./deploy/kubernetes/helm/spark-history-mcp/

# 🎯 Production configuration
helm install spark-history-mcp ./deploy/kubernetes/helm/spark-history-mcp/ \
  --set replicaCount=3 \
  --set autoscaling.enabled=true \
  --set monitoring.enabled=true

πŸ“š See for complete deployment manifests and configuration options.

🌐 Multi-Spark History Server Setup

Setup multiple Spark history servers in the config.yaml and choose which server you want the LLM to interact with for each query.

servers:
  production:
    default: true
    url: "http://prod-spark-history:18080"
    auth:
      username: "user"
      password: "pass"
  staging:
    url: "http://staging-spark-history:18080"

πŸ’ User Query: "Can you get application <app_id> using production server?"

πŸ€– AI Tool Request:

{
  "app_id": "<app_id>",
  "server": "production"
}

πŸ€– AI Tool Response:

{
  "id": "<app_id>>",
  "name": "app_name",
  "coresGranted": null,
  "maxCores": null,
  "coresPerExecutor": null,
  "memoryPerExecutorMB": null,
  "attempts": [
    {
      "attemptId": null,
      "startTime": "2023-09-06T04:44:37.006000Z",
      "endTime": "2023-09-06T04:45:40.431000Z",
      "lastUpdated": "2023-09-06T04:45:42Z",
      "duration": 63425,
      "sparkUser": "spark",
      "appSparkVersion": "3.3.0",
      "completed": true
    }
  ]
}

πŸ” Environment Variables

SHS_MCP_PORT - Port for MCP server (default: 18888)
SHS_MCP_DEBUG - Enable debug mode (default: false)
SHS_MCP_ADDRESS - Address for MCP server (default: localhost)
SHS_MCP_TRANSPORT - MCP transport mode (default: streamable-http)
SHS_SERVERS_*_URL - URL for a specific server
SHS_SERVERS_*_AUTH_USERNAME - Username for a specific server
SHS_SERVERS_*_AUTH_PASSWORD - Password for a specific server
SHS_SERVERS_*_AUTH_TOKEN - Token for a specific server
SHS_SERVERS_*_VERIFY_SSL - Whether to verify SSL for a specific server (true/false)
SHS_SERVERS_*_TIMEOUT - HTTP request timeout in seconds for a specific server (default: 30)
SHS_SERVERS_*_EMR_CLUSTER_ARN - EMR cluster ARN for a specific server

πŸ€– AI Agent Integration

Quick Start Options

IntegrationTransportBest For
HTTPDevelopment, testing tools
STDIOInteractive analysis
STDIOCommand-line automation
HTTPIDE integration, code-centric analysis
HTTPMulti-agent workflows
HTTPMulti-agent workflows

🎯 Example Use Cases

πŸ” Performance Investigation

πŸ€– AI Query: "Why is my ETL job running slower than usual?"

πŸ“Š MCP Actions:
βœ… Analyze application metrics
βœ… Compare with historical performance
βœ… Identify bottleneck stages
βœ… Generate optimization recommendations

🚨 Failure Analysis

πŸ€– AI Query: "What caused job 42 to fail?"

πŸ” MCP Actions:
βœ… Examine failed tasks and error messages
βœ… Review executor logs and resource usage
βœ… Identify root cause and suggest fixes

πŸ“ˆ Comparative Analysis

πŸ€– AI Query: "Compare today's batch job with yesterday's run"

πŸ“Š MCP Actions:
βœ… Compare execution times and resource usage
βœ… Identify performance deltas
βœ… Highlight configuration differences

🀝 Contributing

Check for full guidelines on contributions

πŸ“„ License

Apache License 2.0 - see file for details.

πŸ“ Trademark Notice

This project is built for use with Apache Sparkβ„’ History Server. Not affiliated with or endorsed by the Apache Software Foundation.


πŸ”₯ Connect your Spark infrastructure to AI agents

πŸš€ Get Started | πŸ› οΈ View Tools | | 🀝 Contribute

Built by the community, for the community πŸ’™