mcp-monitoring

reemshai10/mcp-monitoring

3.2

If you are the rightful owner of mcp-monitoring and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

A sophisticated Model Context Protocol (MCP) server that provides intelligent monitoring and observability integration.

Tools
  1. Natural Language Query

    Ask monitoring questions in plain English.

  2. Active Alerts

    Get currently firing alerts.

  3. Prometheus Instant Query

    Execute PromQL queries.

  4. Prometheus Range Query

    Get historical time series data.

📊 Monitoring MCP Server

A sophisticated Model Context Protocol (MCP) server that provides intelligent monitoring and observability integration. This server enables natural language interactions with Prometheus, AlertManager, and Grafana through chat-style commands, advanced query processing, and comprehensive monitoring automation.

🌟 Overview

This MCP server transforms how you interact with monitoring infrastructure by providing:

  • Natural Language Processing: Ask monitoring questions in plain English
  • Intelligent Query Translation: Automatically converts questions to PromQL queries
  • Historical Alert Analysis: Count failures, outages, and incidents over time
  • Multi-Source Integration: Seamlessly works with Prometheus, AlertManager, and Grafana
  • Automated Incident Detection: Smart pattern recognition for service failures

✨ Key Features

🧠 Natural Language Query Engine

  • Smart Intent Recognition: Understands monitoring questions like "How many times did service X fail?"
  • Automatic Time Range Parsing: Handles phrases like "last 2 weeks", "yesterday", "past month"
  • Service Name Detection: Recognizes services like opengrok, jenkins, grafana, prometheus
  • Alert Pattern Matching: Identifies automation failures, service outages, and critical incidents
  • Context-Aware Responses: Provides detailed breakdowns with incident counts and durations

🔍 Prometheus Integration

  • Advanced PromQL Generation: Automatically creates complex queries based on natural language
  • Historical Data Analysis: Analyzes alert trends and service availability over time
  • Metric Discovery: Browse and search available metrics with intelligent filtering
  • Range Query Optimization: Smart step sizing for different time ranges
  • Alert History Tracking: Tracks firing periods and incident detection

🚨 AlertManager Integration

  • Real-time Alert Monitoring: Query active, pending, and resolved alerts
  • Smart Alert Filtering: Filter by service, severity, alertname, or custom labels
  • Alert Fingerprinting: Track unique alert instances and their lifecycle
  • Incident Correlation: Group related alerts and calculate total impact

📊 Grafana Integration (Optional)

  • Dashboard Discovery: Find dashboards related to specific services
  • Dynamic Dashboard Links: Generate direct links to relevant monitoring views
  • Service Context Mapping: Connect services to their monitoring dashboards

🛠️ Available Tools

Natural Language Query

// Ask monitoring questions in plain English
mcp_monitoring_natural_language_query({
  question: "how many times did jenkins fail in the last week?",
  timeRange: "last week"  // optional
})

Active Alerts

// Get currently firing alerts
mcp_monitoring_get_active_alerts({
  filter: "alertname=cleanup-zuultmp"  // optional filter
})

Prometheus Instant Query

// Execute PromQL queries
mcp_monitoring_query_prometheus({
  query: "up{job='prometheus'}",
  time: "2024-01-15T10:30:00Z"  // optional timestamp
})

Prometheus Range Query

// Get historical time series data
mcp_monitoring_query_prometheus_range({
  query: "ALERTS{severity='critical'}",
  start: "2024-01-01T00:00:00Z",
  end: "2024-01-15T00:00:00Z",
  step: "1h"  // optional resolution
})

🚀 Quick Start

Installation

git clone <repository-url>
cd monitoring-mcp
npm install
npm run build

Configuration

Set environment variables:

export PROMETHEUS_URL="https://prometheus.example.com"
export ALERTMANAGER_URL="https://alertmanager.example.com"
export GRAFANA_URL="https://grafana.example.com"          # Optional
export GRAFANA_API_TOKEN="your-grafana-token"             # Optional - Ask admin to create service user and provide token

Running the Server

npm start
# or
node dist/index.js

💬 Natural Language Examples

Service Failure Analysis

Q: "How many times did prevent-opengrok automation fail in the last 2 weeks?"
A: 46 failures over 2 days and 3 hours total downtime

Q: "Show me jenkins outages yesterday"
A: Detailed breakdown of jenkins service interruptions

Q: "Count critical alerts for grafana service this month"
A: Historical analysis with incident timeline

Service Availability Queries

Q: "How many times was prometheus down last week?"
A: Service downtime incidents with duration analysis

Q: "Show cleanup-zuultmp disk usage alerts"  
A: Disk space warnings and critical alerts breakdown

Q: "What automation failures happened in the past 7 days?"
A: Comprehensive automation failure report

🔧 Integration Examples

VS Code MCP Configuration

{
  "servers": {
    "monitoring-mcp": {
      "command": "node",
      "args": [
        "/Users/MCP/mcp-monitoring/dist/index.js"
      ],
      "env": {
        "PROMETHEUS_URL": "${input:prometheus_base_url}",
        "ALERTMANAGER_URL": "${input:alertmanager_base_url}",
        "GRAFANA_URL": "${input:grafana_base_url}",
        "GRAFANA_API_KEY": "${input:grafana_api_key}"
        }
      }
    }
  }
}

For Grafana Token ask the admin to create a service user and provide the token

🎯 Use Cases

DevOps Teams

  • Incident Response: Quickly assess service health and failure patterns
  • Postmortem Analysis: Historical incident data for root cause analysis
  • Capacity Planning: Trend analysis and resource utilization monitoring
  • Alert Fatigue Management: Identify noisy alerts and optimization opportunities

SRE Teams

  • SLI/SLO Monitoring: Service availability and performance tracking
  • Error Budget Analysis: Calculate error rates and availability metrics
  • Automated Reporting: Generate incident reports and availability summaries
  • Proactive Monitoring: Identify patterns before they become critical issues

Development Teams

  • Deployment Monitoring: Track deployment success/failure rates
  • Performance Regression Detection: Compare metrics across releases
  • Integration Testing: Monitor test environment stability
  • Feature Flag Impact: Assess performance impact of feature rollouts

🧩 Architecture

Smart Query Processing Pipeline

  1. Intent Recognition: Parse natural language to understand query type
  2. Service Detection: Identify target services and components
  3. Time Range Extraction: Parse temporal expressions into date ranges
  4. PromQL Generation: Create optimized queries based on intent
  5. Data Analysis: Process results and calculate meaningful metrics
  6. Response Formatting: Present data in human-readable format

Supported Query Types

  • current_alerts: Active/firing alerts right now
  • historical_alerts: Past incidents and failure counts
  • service_availability: Uptime/downtime analysis
  • dashboard_discovery: Find relevant monitoring dashboards
  • metrics: General metric queries and analysis

📈 Performance Features

  • Intelligent Query Optimization: Automatic step sizing for different time ranges
  • Result Caching: Avoid redundant API calls for recent queries
  • Timeout Handling: Graceful handling of slow monitoring APIs
  • Batch Processing: Efficient handling of multi-service queries
  • Memory Management: Optimized for long-running server deployment

🔒 Security & Best Practices

Authentication

  • Secure API token storage for Grafana integration
  • Support for basic auth with Prometheus/AlertManager
  • Environment variable configuration for sensitive data

Network Security

  • HTTPS-only connections to monitoring services
  • Configurable timeout and retry policies
  • Certificate validation for secure connections

Access Control

  • Read-only operations by design
  • No data modification capabilities
  • Audit logging for all monitoring queries

🐛 Troubleshooting

Common Issues

# Connection errors
Error: connect ECONNREFUSED
Solution: Check PROMETHEUS_URL and network connectivity

# Authentication failures  
Error: 401 Unauthorized
Solution: Verify API tokens and authentication credentials

# Query timeouts
Error: timeout of 30000ms exceeded
Solution: Reduce query complexity or time range

# No data returned
Warning: No matching metrics found
Solution: Check service names and time range validity

Debug Mode

# Enable verbose logging
DEBUG=monitoring-mcp node dist/index.js

# Check configuration
node -e "console.log(process.env.PROMETHEUS_URL)"

🚀 Advanced Usage

Custom Service Detection

The server automatically recognizes these services:

  • cleanup-zuultmp, opengrok, jenkins
  • grafana, prometheus, alertmanager
  • gerrit, nginx, mysql, redis, elasticsearch

Advanced Natural Language Patterns

"How many times did [service] fail in the last [time period]?"
"Show me [severity] alerts for [service] [time range]"
"Count [alert name] incidents in [time period]"
"When was [service] down last [time period]?"

🤝 Contributing

Contributions welcome! Please ensure:

  • TypeScript compilation passes (npm run build)
  • Natural language query tests pass
  • Documentation updated for new features
  • Error handling comprehensive

Built with ❤️ for DevOps and SRE teams who want smarter monitoring interactions