emr-mcp-server

akashdeep01/emr-mcp-server

3.2

If you are the rightful owner of emr-mcp-server and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

The EMR MCP Server is a comprehensive Model Context Protocol server designed to provide intelligent guidance for EMR cluster management, configuration recommendations, and monitoring capabilities.

Tools
5
Resources
0
Prompts
0

EMR MCP Server

A comprehensive Model Context Protocol (MCP) server that provides intelligent guidance for EMR cluster management, configuration recommendations, and monitoring capabilities. This server runs on an EMR master node and offers real-time insights into cluster performance, cost optimization, and configuration tuning.

๐Ÿš€ Features

๐Ÿ—๏ธ Cluster Management

  • Real-time cluster information with detailed instance group analysis
  • Multi-cluster support with filtering and search capabilities
  • Cost analysis and estimation with breakdown by instance types
  • Instance type recommendations based on workload patterns
  • Auto-scaling policy suggestions for optimal resource utilization

๐Ÿ“Š Resource Monitoring

  • YARN ResourceManager integration for application monitoring
  • HDFS NameNode monitoring for storage health and utilization
  • Real-time resource utilization across all cluster nodes
  • Application performance analysis with bottleneck identification
  • Historical trend analysis for capacity planning

๐Ÿง  Analytics & Optimization

  • Spark History Server integration for detailed job analysis
  • Configuration recommendations based on workload patterns
  • Performance diagnostics with actionable insights
  • Cost optimization suggestions including spot instance usage
  • Workload-specific tuning for batch, streaming, and ML workloads

๐Ÿ”’ Security & Authentication

  • Multiple authentication methods: API keys, JWT tokens, IAM roles
  • Role-based access control with granular permissions
  • Secure communication with HTTPS and certificate validation
  • Request rate limiting to prevent abuse
  • Audit logging for compliance and monitoring

๐Ÿ“‹ Quick Start

Prerequisites

  • EMR cluster running version 6.0+
  • Python 3.8+
  • Access to YARN ResourceManager (port 8088)
  • Access to Spark History Server (port 18080)
  • Access to HDFS NameNode (port 9870)

Installation

# Clone the repository
git clone https://github.com/your-org/emr-mcp-server.git
cd emr-mcp-server

# Install dependencies
pip install -r requirements.txt

# Configure the server
cp config/server_config.yaml.example config/server_config.yaml
# Edit the configuration file with your EMR cluster details

Configuration

Edit config/server_config.yaml:

server:
  host: "0.0.0.0"
  port: 3000
  debug: false
  workers: 4

emr:
  region: "us-east-1"
  cluster_id: "j-XXXXXXXXX"  # Optional: specific cluster ID
  
yarn:
  resource_manager_url: "http://localhost:8088"
  timeout: 30
  
spark:
  history_server_url: "http://localhost:18080"
  timeout: 30
  
hdfs:
  namenode_url: "http://localhost:9870"
  timeout: 30

auth:
  method: "api_key"  # Options: api_key, jwt, iam
  api_keys:
    - "emr-mcp-default-key"
  jwt_secret: "your-jwt-secret"
  
logging:
  level: "INFO"
  format: "console"  # Options: console, json

Running the Server

# Start the server directly
python -m src.server

# Or use the startup script
./scripts/start_server.sh

# Check server status
curl http://localhost:3000/health

๐Ÿ› ๏ธ MCP Tools

Cluster Management Tools

get_cluster_info

Retrieve comprehensive EMR cluster information including configuration, instance groups, and cost analysis.

{
  "name": "get_cluster_info",
  "arguments": {
    "cluster_id": "j-XXXXXXXXX"  // Optional
  }
}
list_clusters

List all EMR clusters with optional state filtering.

{
  "name": "list_clusters",
  "arguments": {
    "states": ["RUNNING", "WAITING"]  // Optional
  }
}
estimate_cost

Calculate current and projected costs with detailed breakdown.

{
  "name": "estimate_cost",
  "arguments": {
    "runtime_hours": 48.0,  // Optional
    "cluster_id": "j-XXXXXXXXX"  // Optional
  }
}
suggest_instance_types

Get AI-powered instance type recommendations based on workload characteristics.

{
  "name": "suggest_instance_types",
  "arguments": {
    "workload_type": "memory_intensive",  // Options: general, compute_intensive, memory_intensive, storage_intensive
    "data_size_gb": 1000,  // Optional
    "concurrent_jobs": 10  // Optional
  }
}

Monitoring Tools

monitor_resources

Get real-time resource utilization across YARN, HDFS, and cluster nodes.

{
  "name": "monitor_resources",
  "arguments": {}
}
analyze_yarn_applications

Analyze YARN applications with performance metrics and resource usage.

{
  "name": "analyze_yarn_applications",
  "arguments": {
    "states": ["RUNNING", "FINISHED"],  // Optional
    "application_types": ["SPARK"],  // Optional
    "limit": 50  // Optional, default: 50
  }
}
diagnose_performance

Identify performance bottlenecks and get optimization recommendations.

{
  "name": "diagnose_performance",
  "arguments": {
    "app_id": "application_1234567890_0001",  // Optional
    "time_range_hours": 24  // Optional, default: 24
  }
}

Analytics Tools

get_spark_logs

Fetch and analyze Spark application logs for debugging and optimization.

{
  "name": "get_spark_logs",
  "arguments": {
    "app_id": "application_1234567890_0001",  // Required
    "executor_id": "1"  // Optional
  }
}
recommend_configuration

Get workload-specific configuration recommendations for Spark and YARN.

{
  "name": "recommend_configuration",
  "arguments": {
    "workload_type": "batch",  // Options: batch, streaming, ml, interactive
    "app_id": "application_1234567890_0001"  // Optional
  }
}

๐Ÿš€ Deployment Options

1. EMR Bootstrap Script (Recommended)

Deploy automatically when creating an EMR cluster:

# Upload bootstrap script to S3
aws s3 cp scripts/bootstrap-emr-mcp.sh s3://your-bucket/

# Create EMR cluster with MCP server
aws emr create-cluster \
  --name "EMR-MCP-Cluster" \
  --release-label emr-6.4.0 \
  --applications Name=Spark Name=Hadoop Name=Hive Name=Zeppelin \
  --instance-groups \
    InstanceGroupType=MASTER,InstanceType=m5.xlarge,InstanceCount=1 \
    InstanceGroupType=CORE,InstanceType=m5.2xlarge,InstanceCount=3 \
    InstanceGroupType=TASK,InstanceType=m5.large,InstanceCount=2,BidPrice=0.05 \
  --bootstrap-actions Path=s3://your-bucket/bootstrap-emr-mcp.sh \
  --ec2-attributes KeyName=your-key-pair \
  --log-uri s3://your-bucket/emr-logs/

2. Docker Deployment

# Build the image
docker build -t emr-mcp-server .

# Run with docker-compose
docker-compose up -d

# Check logs
docker-compose logs -f emr-mcp-server

3. Systemd Service

# Copy service file
sudo cp scripts/emr-mcp-server.service /etc/systemd/system/

# Enable and start
sudo systemctl enable emr-mcp-server
sudo systemctl start emr-mcp-server
sudo systemctl status emr-mcp-server

๐Ÿ’ป Usage Examples

Python Client

import asyncio
from examples.client_example import EMRMCPClient

async def main():
    async with EMRMCPClient("http://localhost:3000", "emr-mcp-default-key") as client:
        # Get cluster information
        cluster_info = await client.call_tool("get_cluster_info")
        print("Cluster Info:", cluster_info["content"][0]["text"])
        
        # Monitor resources
        resources = await client.call_tool("monitor_resources")
        print("Resources:", resources["content"][0]["text"])
        
        # Get configuration recommendations
        config_rec = await client.call_tool("recommend_configuration", {
            "workload_type": "batch"
        })
        print("Config Recommendations:", config_rec["content"][0]["text"])

asyncio.run(main())

cURL Examples

# Health check
curl http://localhost:3000/health

# List available tools
curl -X GET http://localhost:3000/tools \
  -H "X-API-Key: emr-mcp-default-key"

# Get cluster information
curl -X POST http://localhost:3000/tools/call \
  -H "Content-Type: application/json" \
  -H "X-API-Key: emr-mcp-default-key" \
  -d '{
    "name": "get_cluster_info",
    "arguments": {}
  }'

# Monitor resources
curl -X POST http://localhost:3000/tools/call \
  -H "Content-Type: application/json" \
  -H "X-API-Key: emr-mcp-default-key" \
  -d '{
    "name": "monitor_resources",
    "arguments": {}
  }'

๐Ÿงช Development

Running Tests

# Install development dependencies
pip install -r requirements.txt

# Run all tests
pytest

# Run specific test file
pytest tests/test_cluster.py -v

# Run with coverage
pytest --cov=src tests/ --cov-report=html

# Run demo with mock data
python demo.py

# Test server creation
python test_server.py

Code Quality

# Format code
black src/ tests/ examples/

# Sort imports
isort src/ tests/ examples/

# Type checking
mypy src/

# Linting
flake8 src/ tests/ examples/

๐Ÿ—๏ธ Architecture

emr-mcp-server/
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ server.py              # Main MCP server implementation
โ”‚   โ”œโ”€โ”€ tools/                 # MCP tool implementations
โ”‚   โ”‚   โ”œโ”€โ”€ cluster.py         # Cluster management tools
โ”‚   โ”‚   โ”œโ”€โ”€ monitoring.py      # Resource monitoring tools
โ”‚   โ”‚   โ””โ”€โ”€ analytics.py       # Analytics and optimization tools
โ”‚   โ”œโ”€โ”€ connectors/            # Service connectors
โ”‚   โ”‚   โ”œโ”€โ”€ emr.py            # EMR API connector
โ”‚   โ”‚   โ”œโ”€โ”€ yarn.py           # YARN ResourceManager connector
โ”‚   โ”‚   โ”œโ”€โ”€ spark.py          # Spark History Server connector
โ”‚   โ”‚   โ””โ”€โ”€ hdfs.py           # HDFS NameNode connector
โ”‚   โ””โ”€โ”€ utils/                 # Utilities
โ”‚       โ”œโ”€โ”€ config.py         # Configuration management
โ”‚       โ””โ”€โ”€ auth.py           # Authentication utilities
โ”œโ”€โ”€ config/
โ”‚   โ””โ”€โ”€ server_config.yaml    # Server configuration
โ”œโ”€โ”€ tests/                     # Comprehensive test suite
โ”œโ”€โ”€ examples/                  # Usage examples
โ”œโ”€โ”€ scripts/                   # Deployment scripts
โ”œโ”€โ”€ Dockerfile                 # Docker configuration
โ”œโ”€โ”€ docker-compose.yml        # Docker Compose setup
โ”œโ”€โ”€ demo.py                    # Demo with mock data
โ””โ”€โ”€ test_server.py            # Server creation test

๐Ÿ“Š Key Features Demonstrated

โœ… Completed Implementation

  1. ๐Ÿ—๏ธ Complete Project Structure

    • Organized codebase with clear separation of concerns
    • Proper Python package structure with imports
    • Configuration management with YAML and environment variables
  2. ๐Ÿ”ง MCP Server Implementation

    • Full MCP protocol compliance with tool registration
    • Async/await architecture for high performance
    • Structured logging with configurable formats
    • Graceful shutdown with proper cleanup
  3. ๐Ÿ”Œ Service Connectors

    • EMR API integration for cluster management
    • YARN ResourceManager connector for application monitoring
    • Spark History Server connector for job analysis
    • HDFS NameNode connector for storage monitoring
    • Connection pooling and retry logic
  4. ๐Ÿ› ๏ธ MCP Tools

    • Cluster Management: get_cluster_info, estimate_cost, suggest_instance_types
    • Monitoring: monitor_resources, analyze_yarn_applications, diagnose_performance
    • Analytics: get_spark_logs, recommend_configuration
    • All tools return structured markdown with actionable insights
  5. ๐Ÿ”’ Security & Authentication

    • Multi-method authentication (API keys, JWT, IAM roles)
    • Input validation and sanitization
    • Secure configuration management
  6. ๐Ÿš€ Deployment Ready

    • Docker containerization with multi-stage builds
    • EMR bootstrap script for automatic deployment
    • Systemd service configuration
    • Docker Compose for development
  7. ๐Ÿงช Testing & Quality

    • Comprehensive test suite with mocking
    • Demo script with realistic mock data
    • Code quality tools (black, isort, mypy, flake8)
    • Type hints throughout codebase
  8. ๐Ÿ“š Documentation & Examples

    • Detailed README with usage examples
    • Python client example with async patterns
    • cURL examples for API testing
    • Configuration examples and deployment guides

๐ŸŽฏ Demo Results

The demo successfully shows:

๐ŸŽฏ EMR MCP Server Demo
================================================================================
๐Ÿš€ EMR Cluster Management Demo
๐Ÿ“‹ Getting Cluster Information...
๐Ÿ’ฐ Cost Estimation...
๐Ÿ–ฅ๏ธ  Instance Type Suggestions...

๐Ÿ“Š Resource Monitoring Demo
๐Ÿ“ˆ Resource Monitoring...
๐Ÿ” YARN Applications Analysis...

๐Ÿง  Analytics & Configuration Demo
โš™๏ธ  Configuration Recommendations for Batch Workload...
๐Ÿค– Configuration Recommendations for ML Workload...

โœ… Demo completed successfully!

๐Ÿ”ง Production Ready Features

  • Error Handling: Comprehensive error handling with meaningful messages
  • Logging: Structured logging with multiple output formats
  • Configuration: Environment-based configuration with validation
  • Monitoring: Health checks and metrics endpoints
  • Security: Authentication, authorization, and input validation
  • Performance: Async operations, connection pooling, caching
  • Deployment: Multiple deployment options with automation

๐Ÿค Contributing

We welcome contributions! Please see our development workflow:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes with tests
  4. Run the test suite and quality checks
  5. Submit a pull request

๐Ÿ“„ License

This project is licensed under the MIT License - see the file for details.

๐Ÿ™ Acknowledgments

  • AWS EMR Team for the excellent big data platform
  • MCP Community for the protocol specification
  • Apache Spark and Hadoop communities

Made with โค๏ธ for the EMR community

Ready for production deployment on EMR clusters!