mcp-monitoring

reemshai10/mcp-monitoring

3.1

If you are the rightful owner of mcp-monitoring and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

A sophisticated Model Context Protocol (MCP) server that provides intelligent monitoring and observability integration.

Tools
4
Resources
0
Prompts
0

📊 Monitoring MCP Server

A sophisticated Model Context Protocol (MCP) server that provides intelligent monitoring and observability integration. This server enables natural language interactions with Prometheus, AlertManager, and Grafana through chat-style commands, advanced query processing, and comprehensive monitoring automation.

🌟 Overview

This MCP server transforms how you interact with monitoring infrastructure by providing:

  • Natural Language Processing: Ask monitoring questions in plain English
  • Intelligent Query Translation: Automatically converts questions to PromQL queries
  • Historical Alert Analysis: Count failures, outages, and incidents over time
  • Multi-Source Integration: Seamlessly works with Prometheus, AlertManager, and Grafana
  • Automated Incident Detection: Smart pattern recognition for service failures

✨ Key Features

🧠 Natural Language Query Engine

  • Smart Intent Recognition: Understands monitoring questions like "How many times did service X fail?"
  • Automatic Time Range Parsing: Handles phrases like "last 2 weeks", "yesterday", "past month"
  • Service Name Detection: Recognizes services like opengrok, jenkins, grafana, prometheus
  • Alert Pattern Matching: Identifies automation failures, service outages, and critical incidents
  • Context-Aware Responses: Provides detailed breakdowns with incident counts and durations

🔍 Prometheus Integration

  • Advanced PromQL Generation: Automatically creates complex queries based on natural language
  • Historical Data Analysis: Analyzes alert trends and service availability over time
  • Metric Discovery: Browse and search available metrics with intelligent filtering
  • Range Query Optimization: Smart step sizing for different time ranges
  • Alert History Tracking: Tracks firing periods and incident detection

🚨 AlertManager Integration

  • Real-time Alert Monitoring: Query active, pending, and resolved alerts
  • Smart Alert Filtering: Filter by service, severity, alertname, or custom labels
  • Alert Fingerprinting: Track unique alert instances and their lifecycle
  • Incident Correlation: Group related alerts and calculate total impact

📊 Grafana Integration (Optional)

  • Dashboard Discovery: Find dashboards related to specific services
  • Dynamic Dashboard Links: Generate direct links to relevant monitoring views
  • Service Context Mapping: Connect services to their monitoring dashboards

🛠️ Available Tools

Natural Language Query

// Ask monitoring questions in plain English
mcp_monitoring_natural_language_query({
  question: "how many times did jenkins fail in the last week?",
  timeRange: "last week"  // optional
})

Active Alerts

// Get currently firing alerts
mcp_monitoring_get_active_alerts({
  filter: "alertname=cleanup-zuultmp"  // optional filter
})

Prometheus Instant Query

// Execute PromQL queries
mcp_monitoring_query_prometheus({
  query: "up{job='prometheus'}",
  time: "2024-01-15T10:30:00Z"  // optional timestamp
})

Prometheus Range Query

// Get historical time series data
mcp_monitoring_query_prometheus_range({
  query: "ALERTS{severity='critical'}",
  start: "2024-01-01T00:00:00Z",
  end: "2024-01-15T00:00:00Z",
  step: "1h"  // optional resolution
})

🚀 Quick Start

Installation

git clone <repository-url>
cd monitoring-mcp
npm install
npm run build

Configuration

Set environment variables:

export PROMETHEUS_URL="https://prometheus.example.com"
export ALERTMANAGER_URL="https://alertmanager.example.com"
export GRAFANA_URL="https://grafana.example.com"          # Optional
export GRAFANA_API_TOKEN="your-grafana-token"             # Optional - Ask admin to create service user and provide token

Running the Server

npm start
# or
node dist/index.js

💬 Natural Language Examples

Service Failure Analysis

Q: "How many times did prevent-opengrok automation fail in the last 2 weeks?"
A: 46 failures over 2 days and 3 hours total downtime

Q: "Show me jenkins outages yesterday"
A: Detailed breakdown of jenkins service interruptions

Q: "Count critical alerts for grafana service this month"
A: Historical analysis with incident timeline

Service Availability Queries

Q: "How many times was prometheus down last week?"
A: Service downtime incidents with duration analysis

Q: "Show cleanup-zuultmp disk usage alerts"  
A: Disk space warnings and critical alerts breakdown

Q: "What automation failures happened in the past 7 days?"
A: Comprehensive automation failure report

🔧 Integration Examples

VS Code MCP Configuration

{
  "servers": {
    "monitoring-mcp": {
      "command": "node",
      "args": [
        "/Users/MCP/mcp-monitoring/dist/index.js"
      ],
      "env": {
        "PROMETHEUS_URL": "${input:prometheus_base_url}",
        "ALERTMANAGER_URL": "${input:alertmanager_base_url}",
        "GRAFANA_URL": "${input:grafana_base_url}",
        "GRAFANA_API_KEY": "${input:grafana_api_key}"
        }
      }
    }
  }
}

For Grafana Token ask the admin to create a service user and provide the token

🎯 Use Cases

DevOps Teams

  • Incident Response: Quickly assess service health and failure patterns
  • Postmortem Analysis: Historical incident data for root cause analysis
  • Capacity Planning: Trend analysis and resource utilization monitoring
  • Alert Fatigue Management: Identify noisy alerts and optimization opportunities

SRE Teams

  • SLI/SLO Monitoring: Service availability and performance tracking
  • Error Budget Analysis: Calculate error rates and availability metrics
  • Automated Reporting: Generate incident reports and availability summaries
  • Proactive Monitoring: Identify patterns before they become critical issues

Development Teams

  • Deployment Monitoring: Track deployment success/failure rates
  • Performance Regression Detection: Compare metrics across releases
  • Integration Testing: Monitor test environment stability
  • Feature Flag Impact: Assess performance impact of feature rollouts

🧩 Architecture

Smart Query Processing Pipeline

  1. Intent Recognition: Parse natural language to understand query type
  2. Service Detection: Identify target services and components
  3. Time Range Extraction: Parse temporal expressions into date ranges
  4. PromQL Generation: Create optimized queries based on intent
  5. Data Analysis: Process results and calculate meaningful metrics
  6. Response Formatting: Present data in human-readable format

Supported Query Types

  • current_alerts: Active/firing alerts right now
  • historical_alerts: Past incidents and failure counts
  • service_availability: Uptime/downtime analysis
  • dashboard_discovery: Find relevant monitoring dashboards
  • metrics: General metric queries and analysis

📈 Performance Features

  • Intelligent Query Optimization: Automatic step sizing for different time ranges
  • Result Caching: Avoid redundant API calls for recent queries
  • Timeout Handling: Graceful handling of slow monitoring APIs
  • Batch Processing: Efficient handling of multi-service queries
  • Memory Management: Optimized for long-running server deployment

🔒 Security & Best Practices

Authentication

  • Secure API token storage for Grafana integration
  • Support for basic auth with Prometheus/AlertManager
  • Environment variable configuration for sensitive data

Network Security

  • HTTPS-only connections to monitoring services
  • Configurable timeout and retry policies
  • Certificate validation for secure connections

Access Control

  • Read-only operations by design
  • No data modification capabilities
  • Audit logging for all monitoring queries

🐛 Troubleshooting

Common Issues

# Connection errors
Error: connect ECONNREFUSED
Solution: Check PROMETHEUS_URL and network connectivity

# Authentication failures  
Error: 401 Unauthorized
Solution: Verify API tokens and authentication credentials

# Query timeouts
Error: timeout of 30000ms exceeded
Solution: Reduce query complexity or time range

# No data returned
Warning: No matching metrics found
Solution: Check service names and time range validity

Debug Mode

# Enable verbose logging
DEBUG=monitoring-mcp node dist/index.js

# Check configuration
node -e "console.log(process.env.PROMETHEUS_URL)"

🚀 Advanced Usage

Custom Service Detection

The server automatically recognizes these services:

  • cleanup-zuultmp, opengrok, jenkins
  • grafana, prometheus, alertmanager
  • gerrit, nginx, mysql, redis, elasticsearch

Advanced Natural Language Patterns

"How many times did [service] fail in the last [time period]?"
"Show me [severity] alerts for [service] [time range]"
"Count [alert name] incidents in [time period]"
"When was [service] down last [time period]?"

🤝 Contributing

Contributions welcome! Please ensure:

  • TypeScript compilation passes (npm run build)
  • Natural language query tests pass
  • Documentation updated for new features
  • Error handling comprehensive

Built with ❤️ for DevOps and SRE teams who want smarter monitoring interactions