mcp-monitoring by reemshai10 - MCP Server

📊 Monitoring MCP Server

A sophisticated Model Context Protocol (MCP) server that provides intelligent monitoring and observability integration. This server enables natural language interactions with Prometheus, AlertManager, and Grafana through chat-style commands, advanced query processing, and comprehensive monitoring automation.

🌟 Overview

This MCP server transforms how you interact with monitoring infrastructure by providing:

Natural Language Processing: Ask monitoring questions in plain English
Intelligent Query Translation: Automatically converts questions to PromQL queries
Historical Alert Analysis: Count failures, outages, and incidents over time
Multi-Source Integration: Seamlessly works with Prometheus, AlertManager, and Grafana
Automated Incident Detection: Smart pattern recognition for service failures

✨ Key Features

🧠 Natural Language Query Engine

Smart Intent Recognition: Understands monitoring questions like "How many times did service X fail?"
Automatic Time Range Parsing: Handles phrases like "last 2 weeks", "yesterday", "past month"
Service Name Detection: Recognizes services like opengrok, jenkins, grafana, prometheus
Alert Pattern Matching: Identifies automation failures, service outages, and critical incidents
Context-Aware Responses: Provides detailed breakdowns with incident counts and durations

🔍 Prometheus Integration

Advanced PromQL Generation: Automatically creates complex queries based on natural language
Historical Data Analysis: Analyzes alert trends and service availability over time
Metric Discovery: Browse and search available metrics with intelligent filtering
Range Query Optimization: Smart step sizing for different time ranges
Alert History Tracking: Tracks firing periods and incident detection

🚨 AlertManager Integration

Real-time Alert Monitoring: Query active, pending, and resolved alerts
Smart Alert Filtering: Filter by service, severity, alertname, or custom labels
Alert Fingerprinting: Track unique alert instances and their lifecycle
Incident Correlation: Group related alerts and calculate total impact

📊 Grafana Integration (Optional)

Dashboard Discovery: Find dashboards related to specific services
Dynamic Dashboard Links: Generate direct links to relevant monitoring views
Service Context Mapping: Connect services to their monitoring dashboards

🛠️ Available Tools

Natural Language Query

// Ask monitoring questions in plain English
mcp_monitoring_natural_language_query({
  question: "how many times did jenkins fail in the last week?",
  timeRange: "last week"  // optional
})

Active Alerts

// Get currently firing alerts
mcp_monitoring_get_active_alerts({
  filter: "alertname=cleanup-zuultmp"  // optional filter
})

Prometheus Instant Query

// Execute PromQL queries
mcp_monitoring_query_prometheus({
  query: "up{job='prometheus'}",
  time: "2024-01-15T10:30:00Z"  // optional timestamp
})

Prometheus Range Query

// Get historical time series data
mcp_monitoring_query_prometheus_range({
  query: "ALERTS{severity='critical'}",
  start: "2024-01-01T00:00:00Z",
  end: "2024-01-15T00:00:00Z",
  step: "1h"  // optional resolution
})

🚀 Quick Start

Installation

git clone <repository-url>
cd monitoring-mcp
npm install
npm run build

Configuration

Set environment variables:

export PROMETHEUS_URL="https://prometheus.example.com"
export ALERTMANAGER_URL="https://alertmanager.example.com"
export GRAFANA_URL="https://grafana.example.com"          # Optional
export GRAFANA_API_TOKEN="your-grafana-token"             # Optional - Ask admin to create service user and provide token

Running the Server

npm start
# or
node dist/index.js

💬 Natural Language Examples

Service Failure Analysis

Q: "How many times did prevent-opengrok automation fail in the last 2 weeks?"
A: 46 failures over 2 days and 3 hours total downtime

Q: "Show me jenkins outages yesterday"
A: Detailed breakdown of jenkins service interruptions

Q: "Count critical alerts for grafana service this month"
A: Historical analysis with incident timeline

Service Availability Queries

Q: "How many times was prometheus down last week?"
A: Service downtime incidents with duration analysis

Q: "Show cleanup-zuultmp disk usage alerts"  
A: Disk space warnings and critical alerts breakdown

Q: "What automation failures happened in the past 7 days?"
A: Comprehensive automation failure report

🔧 Integration Examples

VS Code MCP Configuration

{
  "servers": {
    "monitoring-mcp": {
      "command": "node",
      "args": [
        "/Users/MCP/mcp-monitoring/dist/index.js"
      ],
      "env": {
        "PROMETHEUS_URL": "${input:prometheus_base_url}",
        "ALERTMANAGER_URL": "${input:alertmanager_base_url}",
        "GRAFANA_URL": "${input:grafana_base_url}",
        "GRAFANA_API_KEY": "${input:grafana_api_key}"
        }
      }
    }
  }
}

For Grafana Token ask the admin to create a service user and provide the token

🎯 Use Cases

DevOps Teams

Incident Response: Quickly assess service health and failure patterns
Postmortem Analysis: Historical incident data for root cause analysis
Capacity Planning: Trend analysis and resource utilization monitoring
Alert Fatigue Management: Identify noisy alerts and optimization opportunities

SRE Teams

SLI/SLO Monitoring: Service availability and performance tracking
Error Budget Analysis: Calculate error rates and availability metrics
Automated Reporting: Generate incident reports and availability summaries
Proactive Monitoring: Identify patterns before they become critical issues

Development Teams

Deployment Monitoring: Track deployment success/failure rates
Performance Regression Detection: Compare metrics across releases
Integration Testing: Monitor test environment stability
Feature Flag Impact: Assess performance impact of feature rollouts

🧩 Architecture

Smart Query Processing Pipeline

Intent Recognition: Parse natural language to understand query type
Service Detection: Identify target services and components
Time Range Extraction: Parse temporal expressions into date ranges
PromQL Generation: Create optimized queries based on intent
Data Analysis: Process results and calculate meaningful metrics
Response Formatting: Present data in human-readable format

Supported Query Types

current_alerts: Active/firing alerts right now
historical_alerts: Past incidents and failure counts
service_availability: Uptime/downtime analysis
dashboard_discovery: Find relevant monitoring dashboards
metrics: General metric queries and analysis

📈 Performance Features

Intelligent Query Optimization: Automatic step sizing for different time ranges
Result Caching: Avoid redundant API calls for recent queries
Timeout Handling: Graceful handling of slow monitoring APIs
Batch Processing: Efficient handling of multi-service queries
Memory Management: Optimized for long-running server deployment

🔒 Security & Best Practices

Authentication

Secure API token storage for Grafana integration
Support for basic auth with Prometheus/AlertManager
Environment variable configuration for sensitive data

Network Security

HTTPS-only connections to monitoring services
Configurable timeout and retry policies
Certificate validation for secure connections

Access Control

Read-only operations by design
No data modification capabilities
Audit logging for all monitoring queries

🐛 Troubleshooting

Common Issues

# Connection errors
Error: connect ECONNREFUSED
Solution: Check PROMETHEUS_URL and network connectivity

# Authentication failures  
Error: 401 Unauthorized
Solution: Verify API tokens and authentication credentials

# Query timeouts
Error: timeout of 30000ms exceeded
Solution: Reduce query complexity or time range

# No data returned
Warning: No matching metrics found
Solution: Check service names and time range validity

Debug Mode

# Enable verbose logging
DEBUG=monitoring-mcp node dist/index.js

# Check configuration
node -e "console.log(process.env.PROMETHEUS_URL)"

🚀 Advanced Usage

Custom Service Detection

The server automatically recognizes these services:

cleanup-zuultmp, opengrok, jenkins
grafana, prometheus, alertmanager
gerrit, nginx, mysql, redis, elasticsearch

Advanced Natural Language Patterns

"How many times did [service] fail in the last [time period]?"
"Show me [severity] alerts for [service] [time range]"
"Count [alert name] incidents in [time period]"
"When was [service] down last [time period]?"

🤝 Contributing

Contributions welcome! Please ensure:

TypeScript compilation passes (npm run build)
Natural language query tests pass
Documentation updated for new features
Error handling comprehensive

Built with ❤️ for DevOps and SRE teams who want smarter monitoring interactions