mcp-scp by repolhomp3 - MCP Server

SRE MCP Server

An MCP (Model Context Protocol) server designed to automate SRE tasks, runbooks, and interactive playbooks with AWS Bedrock integration for intelligent decision making.

Features

Automated Runbooks: Execute predefined SRE procedures (pod restarts, health checks, disk cleanup)
Interactive Playbooks: Step-by-step guided workflows for incident response and maintenance
AWS Bedrock Integration: Intelligent log analysis and decision support
Kubernetes Native: Deep integration with EKS clusters
Security First: RBAC, non-root containers, read-only filesystems

Quick Start

Prerequisites

EKS cluster with kubectl access
Docker for building images
AWS credentials configured for Bedrock access

Deploy to EKS

# Set your container registry (optional)
export REGISTRY=your-registry.com

# Deploy to cluster
./deploy.sh

Local Development

# Install dependencies
pip install -r requirements.txt

# Run locally (requires kubeconfig)
python src/sre_mcp_server.py

MCP Tools Available

Core Operations

execute_runbook: Run predefined automation procedures
start_interactive_playbook: Begin guided incident response workflows
get_cluster_health: Comprehensive cluster status and metrics
analyze_logs: AI-powered log analysis using AWS Bedrock
scale_deployment: Safe deployment scaling with confirmations
create_incident_response: Structured incident management

Example Usage

# Execute a runbook
await execute_runbook(
    runbook_name="pod_restart",
    parameters={"namespace": "production", "selector": "app=web"},
    dry_run=True
)

# Start incident response
await start_interactive_playbook(
    playbook_type="incident_response",
    incident_id="INC-001",
    severity="high"
)

# Analyze logs with AI
await analyze_logs(
    service_name="api-service",
    time_range="1h",
    log_level="error"
)

Available Runbooks

pod_restart: Graceful pod restart with health verification
service_health_check: Comprehensive service and endpoint validation
disk_cleanup: Automated cleanup of temporary files and logs

Interactive Playbooks

incident_response: Structured incident management workflow
deployment_rollback: Safe rollback procedures with verification
maintenance: Guided maintenance operations

Configuration

Environment Variables

AWS_REGION: AWS region for Bedrock (default: us-east-1)
LOG_LEVEL: Logging level (default: INFO)

Kubernetes RBAC

The server requires cluster-level permissions for:

Reading pods, services, deployments
Updating/patching deployments for scaling
Accessing metrics APIs

Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   MCP Client    │────│  SRE MCP Server  │────│  AWS Bedrock    │
│  (Q Developer)  │    │                  │    │   (Claude-3)    │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                                │
                                │
                       ┌──────────────────┐
                       │   EKS Cluster    │
                       │  (Kubernetes)    │
                       └──────────────────┘

Security Considerations

Non-root container execution
Read-only root filesystem
Minimal RBAC permissions
Secure secrets management for AWS credentials
Network policies for pod-to-pod communication

Monitoring & Observability

Structured JSON logging with correlation IDs
Health and readiness probes
Prometheus metrics endpoint (planned)
Distributed tracing support (planned)

Development

Adding New Runbooks

Add runbook definition to src/runbooks.py
Register in SREMCPServer.setup_tools()
Update documentation

Adding New Playbooks

Create playbook template in src/playbooks.py
Define step-by-step workflow
Add automation hooks where appropriate

Contributing

Follow existing code patterns
Add comprehensive error handling
Include structured logging
Update documentation
Test with dry-run capabilities