akashdeep01/emr-mcp-server
If you are the rightful owner of emr-mcp-server and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
The EMR MCP Server is a comprehensive Model Context Protocol server designed to provide intelligent guidance for EMR cluster management, configuration recommendations, and monitoring capabilities.
EMR MCP Server
A comprehensive Model Context Protocol (MCP) server that provides intelligent guidance for EMR cluster management, configuration recommendations, and monitoring capabilities. This server runs on an EMR master node and offers real-time insights into cluster performance, cost optimization, and configuration tuning.
๐ Features
๐๏ธ Cluster Management
- Real-time cluster information with detailed instance group analysis
- Multi-cluster support with filtering and search capabilities
- Cost analysis and estimation with breakdown by instance types
- Instance type recommendations based on workload patterns
- Auto-scaling policy suggestions for optimal resource utilization
๐ Resource Monitoring
- YARN ResourceManager integration for application monitoring
- HDFS NameNode monitoring for storage health and utilization
- Real-time resource utilization across all cluster nodes
- Application performance analysis with bottleneck identification
- Historical trend analysis for capacity planning
๐ง Analytics & Optimization
- Spark History Server integration for detailed job analysis
- Configuration recommendations based on workload patterns
- Performance diagnostics with actionable insights
- Cost optimization suggestions including spot instance usage
- Workload-specific tuning for batch, streaming, and ML workloads
๐ Security & Authentication
- Multiple authentication methods: API keys, JWT tokens, IAM roles
- Role-based access control with granular permissions
- Secure communication with HTTPS and certificate validation
- Request rate limiting to prevent abuse
- Audit logging for compliance and monitoring
๐ Quick Start
Prerequisites
- EMR cluster running version 6.0+
- Python 3.8+
- Access to YARN ResourceManager (port 8088)
- Access to Spark History Server (port 18080)
- Access to HDFS NameNode (port 9870)
Installation
# Clone the repository
git clone https://github.com/your-org/emr-mcp-server.git
cd emr-mcp-server
# Install dependencies
pip install -r requirements.txt
# Configure the server
cp config/server_config.yaml.example config/server_config.yaml
# Edit the configuration file with your EMR cluster details
Configuration
Edit config/server_config.yaml
:
server:
host: "0.0.0.0"
port: 3000
debug: false
workers: 4
emr:
region: "us-east-1"
cluster_id: "j-XXXXXXXXX" # Optional: specific cluster ID
yarn:
resource_manager_url: "http://localhost:8088"
timeout: 30
spark:
history_server_url: "http://localhost:18080"
timeout: 30
hdfs:
namenode_url: "http://localhost:9870"
timeout: 30
auth:
method: "api_key" # Options: api_key, jwt, iam
api_keys:
- "emr-mcp-default-key"
jwt_secret: "your-jwt-secret"
logging:
level: "INFO"
format: "console" # Options: console, json
Running the Server
# Start the server directly
python -m src.server
# Or use the startup script
./scripts/start_server.sh
# Check server status
curl http://localhost:3000/health
๐ ๏ธ MCP Tools
Cluster Management Tools
get_cluster_info
Retrieve comprehensive EMR cluster information including configuration, instance groups, and cost analysis.
{
"name": "get_cluster_info",
"arguments": {
"cluster_id": "j-XXXXXXXXX" // Optional
}
}
list_clusters
List all EMR clusters with optional state filtering.
{
"name": "list_clusters",
"arguments": {
"states": ["RUNNING", "WAITING"] // Optional
}
}
estimate_cost
Calculate current and projected costs with detailed breakdown.
{
"name": "estimate_cost",
"arguments": {
"runtime_hours": 48.0, // Optional
"cluster_id": "j-XXXXXXXXX" // Optional
}
}
suggest_instance_types
Get AI-powered instance type recommendations based on workload characteristics.
{
"name": "suggest_instance_types",
"arguments": {
"workload_type": "memory_intensive", // Options: general, compute_intensive, memory_intensive, storage_intensive
"data_size_gb": 1000, // Optional
"concurrent_jobs": 10 // Optional
}
}
Monitoring Tools
monitor_resources
Get real-time resource utilization across YARN, HDFS, and cluster nodes.
{
"name": "monitor_resources",
"arguments": {}
}
analyze_yarn_applications
Analyze YARN applications with performance metrics and resource usage.
{
"name": "analyze_yarn_applications",
"arguments": {
"states": ["RUNNING", "FINISHED"], // Optional
"application_types": ["SPARK"], // Optional
"limit": 50 // Optional, default: 50
}
}
diagnose_performance
Identify performance bottlenecks and get optimization recommendations.
{
"name": "diagnose_performance",
"arguments": {
"app_id": "application_1234567890_0001", // Optional
"time_range_hours": 24 // Optional, default: 24
}
}
Analytics Tools
get_spark_logs
Fetch and analyze Spark application logs for debugging and optimization.
{
"name": "get_spark_logs",
"arguments": {
"app_id": "application_1234567890_0001", // Required
"executor_id": "1" // Optional
}
}
recommend_configuration
Get workload-specific configuration recommendations for Spark and YARN.
{
"name": "recommend_configuration",
"arguments": {
"workload_type": "batch", // Options: batch, streaming, ml, interactive
"app_id": "application_1234567890_0001" // Optional
}
}
๐ Deployment Options
1. EMR Bootstrap Script (Recommended)
Deploy automatically when creating an EMR cluster:
# Upload bootstrap script to S3
aws s3 cp scripts/bootstrap-emr-mcp.sh s3://your-bucket/
# Create EMR cluster with MCP server
aws emr create-cluster \
--name "EMR-MCP-Cluster" \
--release-label emr-6.4.0 \
--applications Name=Spark Name=Hadoop Name=Hive Name=Zeppelin \
--instance-groups \
InstanceGroupType=MASTER,InstanceType=m5.xlarge,InstanceCount=1 \
InstanceGroupType=CORE,InstanceType=m5.2xlarge,InstanceCount=3 \
InstanceGroupType=TASK,InstanceType=m5.large,InstanceCount=2,BidPrice=0.05 \
--bootstrap-actions Path=s3://your-bucket/bootstrap-emr-mcp.sh \
--ec2-attributes KeyName=your-key-pair \
--log-uri s3://your-bucket/emr-logs/
2. Docker Deployment
# Build the image
docker build -t emr-mcp-server .
# Run with docker-compose
docker-compose up -d
# Check logs
docker-compose logs -f emr-mcp-server
3. Systemd Service
# Copy service file
sudo cp scripts/emr-mcp-server.service /etc/systemd/system/
# Enable and start
sudo systemctl enable emr-mcp-server
sudo systemctl start emr-mcp-server
sudo systemctl status emr-mcp-server
๐ป Usage Examples
Python Client
import asyncio
from examples.client_example import EMRMCPClient
async def main():
async with EMRMCPClient("http://localhost:3000", "emr-mcp-default-key") as client:
# Get cluster information
cluster_info = await client.call_tool("get_cluster_info")
print("Cluster Info:", cluster_info["content"][0]["text"])
# Monitor resources
resources = await client.call_tool("monitor_resources")
print("Resources:", resources["content"][0]["text"])
# Get configuration recommendations
config_rec = await client.call_tool("recommend_configuration", {
"workload_type": "batch"
})
print("Config Recommendations:", config_rec["content"][0]["text"])
asyncio.run(main())
cURL Examples
# Health check
curl http://localhost:3000/health
# List available tools
curl -X GET http://localhost:3000/tools \
-H "X-API-Key: emr-mcp-default-key"
# Get cluster information
curl -X POST http://localhost:3000/tools/call \
-H "Content-Type: application/json" \
-H "X-API-Key: emr-mcp-default-key" \
-d '{
"name": "get_cluster_info",
"arguments": {}
}'
# Monitor resources
curl -X POST http://localhost:3000/tools/call \
-H "Content-Type: application/json" \
-H "X-API-Key: emr-mcp-default-key" \
-d '{
"name": "monitor_resources",
"arguments": {}
}'
๐งช Development
Running Tests
# Install development dependencies
pip install -r requirements.txt
# Run all tests
pytest
# Run specific test file
pytest tests/test_cluster.py -v
# Run with coverage
pytest --cov=src tests/ --cov-report=html
# Run demo with mock data
python demo.py
# Test server creation
python test_server.py
Code Quality
# Format code
black src/ tests/ examples/
# Sort imports
isort src/ tests/ examples/
# Type checking
mypy src/
# Linting
flake8 src/ tests/ examples/
๐๏ธ Architecture
emr-mcp-server/
โโโ src/
โ โโโ server.py # Main MCP server implementation
โ โโโ tools/ # MCP tool implementations
โ โ โโโ cluster.py # Cluster management tools
โ โ โโโ monitoring.py # Resource monitoring tools
โ โ โโโ analytics.py # Analytics and optimization tools
โ โโโ connectors/ # Service connectors
โ โ โโโ emr.py # EMR API connector
โ โ โโโ yarn.py # YARN ResourceManager connector
โ โ โโโ spark.py # Spark History Server connector
โ โ โโโ hdfs.py # HDFS NameNode connector
โ โโโ utils/ # Utilities
โ โโโ config.py # Configuration management
โ โโโ auth.py # Authentication utilities
โโโ config/
โ โโโ server_config.yaml # Server configuration
โโโ tests/ # Comprehensive test suite
โโโ examples/ # Usage examples
โโโ scripts/ # Deployment scripts
โโโ Dockerfile # Docker configuration
โโโ docker-compose.yml # Docker Compose setup
โโโ demo.py # Demo with mock data
โโโ test_server.py # Server creation test
๐ Key Features Demonstrated
โ Completed Implementation
-
๐๏ธ Complete Project Structure
- Organized codebase with clear separation of concerns
- Proper Python package structure with imports
- Configuration management with YAML and environment variables
-
๐ง MCP Server Implementation
- Full MCP protocol compliance with tool registration
- Async/await architecture for high performance
- Structured logging with configurable formats
- Graceful shutdown with proper cleanup
-
๐ Service Connectors
- EMR API integration for cluster management
- YARN ResourceManager connector for application monitoring
- Spark History Server connector for job analysis
- HDFS NameNode connector for storage monitoring
- Connection pooling and retry logic
-
๐ ๏ธ MCP Tools
- Cluster Management: get_cluster_info, estimate_cost, suggest_instance_types
- Monitoring: monitor_resources, analyze_yarn_applications, diagnose_performance
- Analytics: get_spark_logs, recommend_configuration
- All tools return structured markdown with actionable insights
-
๐ Security & Authentication
- Multi-method authentication (API keys, JWT, IAM roles)
- Input validation and sanitization
- Secure configuration management
-
๐ Deployment Ready
- Docker containerization with multi-stage builds
- EMR bootstrap script for automatic deployment
- Systemd service configuration
- Docker Compose for development
-
๐งช Testing & Quality
- Comprehensive test suite with mocking
- Demo script with realistic mock data
- Code quality tools (black, isort, mypy, flake8)
- Type hints throughout codebase
-
๐ Documentation & Examples
- Detailed README with usage examples
- Python client example with async patterns
- cURL examples for API testing
- Configuration examples and deployment guides
๐ฏ Demo Results
The demo successfully shows:
๐ฏ EMR MCP Server Demo
================================================================================
๐ EMR Cluster Management Demo
๐ Getting Cluster Information...
๐ฐ Cost Estimation...
๐ฅ๏ธ Instance Type Suggestions...
๐ Resource Monitoring Demo
๐ Resource Monitoring...
๐ YARN Applications Analysis...
๐ง Analytics & Configuration Demo
โ๏ธ Configuration Recommendations for Batch Workload...
๐ค Configuration Recommendations for ML Workload...
โ
Demo completed successfully!
๐ง Production Ready Features
- Error Handling: Comprehensive error handling with meaningful messages
- Logging: Structured logging with multiple output formats
- Configuration: Environment-based configuration with validation
- Monitoring: Health checks and metrics endpoints
- Security: Authentication, authorization, and input validation
- Performance: Async operations, connection pooling, caching
- Deployment: Multiple deployment options with automation
๐ค Contributing
We welcome contributions! Please see our development workflow:
- Fork the repository
- Create a feature branch
- Make your changes with tests
- Run the test suite and quality checks
- Submit a pull request
๐ License
This project is licensed under the MIT License - see the file for details.
๐ Acknowledgments
- AWS EMR Team for the excellent big data platform
- MCP Community for the protocol specification
- Apache Spark and Hadoop communities
Made with โค๏ธ for the EMR community
Ready for production deployment on EMR clusters!