reemshai10/mcp-monitoring
If you are the rightful owner of mcp-monitoring and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
A sophisticated Model Context Protocol (MCP) server that provides intelligent monitoring and observability integration.
Natural Language Query
Ask monitoring questions in plain English.
Active Alerts
Get currently firing alerts.
Prometheus Instant Query
Execute PromQL queries.
Prometheus Range Query
Get historical time series data.
📊 Monitoring MCP Server
A sophisticated Model Context Protocol (MCP) server that provides intelligent monitoring and observability integration. This server enables natural language interactions with Prometheus, AlertManager, and Grafana through chat-style commands, advanced query processing, and comprehensive monitoring automation.
🌟 Overview
This MCP server transforms how you interact with monitoring infrastructure by providing:
- Natural Language Processing: Ask monitoring questions in plain English
- Intelligent Query Translation: Automatically converts questions to PromQL queries
- Historical Alert Analysis: Count failures, outages, and incidents over time
- Multi-Source Integration: Seamlessly works with Prometheus, AlertManager, and Grafana
- Automated Incident Detection: Smart pattern recognition for service failures
✨ Key Features
🧠 Natural Language Query Engine
- Smart Intent Recognition: Understands monitoring questions like "How many times did service X fail?"
- Automatic Time Range Parsing: Handles phrases like "last 2 weeks", "yesterday", "past month"
- Service Name Detection: Recognizes services like opengrok, jenkins, grafana, prometheus
- Alert Pattern Matching: Identifies automation failures, service outages, and critical incidents
- Context-Aware Responses: Provides detailed breakdowns with incident counts and durations
🔍 Prometheus Integration
- Advanced PromQL Generation: Automatically creates complex queries based on natural language
- Historical Data Analysis: Analyzes alert trends and service availability over time
- Metric Discovery: Browse and search available metrics with intelligent filtering
- Range Query Optimization: Smart step sizing for different time ranges
- Alert History Tracking: Tracks firing periods and incident detection
🚨 AlertManager Integration
- Real-time Alert Monitoring: Query active, pending, and resolved alerts
- Smart Alert Filtering: Filter by service, severity, alertname, or custom labels
- Alert Fingerprinting: Track unique alert instances and their lifecycle
- Incident Correlation: Group related alerts and calculate total impact
📊 Grafana Integration (Optional)
- Dashboard Discovery: Find dashboards related to specific services
- Dynamic Dashboard Links: Generate direct links to relevant monitoring views
- Service Context Mapping: Connect services to their monitoring dashboards
🛠️ Available Tools
Natural Language Query
// Ask monitoring questions in plain English
mcp_monitoring_natural_language_query({
question: "how many times did jenkins fail in the last week?",
timeRange: "last week" // optional
})
Active Alerts
// Get currently firing alerts
mcp_monitoring_get_active_alerts({
filter: "alertname=cleanup-zuultmp" // optional filter
})
Prometheus Instant Query
// Execute PromQL queries
mcp_monitoring_query_prometheus({
query: "up{job='prometheus'}",
time: "2024-01-15T10:30:00Z" // optional timestamp
})
Prometheus Range Query
// Get historical time series data
mcp_monitoring_query_prometheus_range({
query: "ALERTS{severity='critical'}",
start: "2024-01-01T00:00:00Z",
end: "2024-01-15T00:00:00Z",
step: "1h" // optional resolution
})
🚀 Quick Start
Installation
git clone <repository-url>
cd monitoring-mcp
npm install
npm run build
Configuration
Set environment variables:
export PROMETHEUS_URL="https://prometheus.example.com"
export ALERTMANAGER_URL="https://alertmanager.example.com"
export GRAFANA_URL="https://grafana.example.com" # Optional
export GRAFANA_API_TOKEN="your-grafana-token" # Optional - Ask admin to create service user and provide token
Running the Server
npm start
# or
node dist/index.js
💬 Natural Language Examples
Service Failure Analysis
Q: "How many times did prevent-opengrok automation fail in the last 2 weeks?"
A: 46 failures over 2 days and 3 hours total downtime
Q: "Show me jenkins outages yesterday"
A: Detailed breakdown of jenkins service interruptions
Q: "Count critical alerts for grafana service this month"
A: Historical analysis with incident timeline
Service Availability Queries
Q: "How many times was prometheus down last week?"
A: Service downtime incidents with duration analysis
Q: "Show cleanup-zuultmp disk usage alerts"
A: Disk space warnings and critical alerts breakdown
Q: "What automation failures happened in the past 7 days?"
A: Comprehensive automation failure report
🔧 Integration Examples
VS Code MCP Configuration
{
"servers": {
"monitoring-mcp": {
"command": "node",
"args": [
"/Users/MCP/mcp-monitoring/dist/index.js"
],
"env": {
"PROMETHEUS_URL": "${input:prometheus_base_url}",
"ALERTMANAGER_URL": "${input:alertmanager_base_url}",
"GRAFANA_URL": "${input:grafana_base_url}",
"GRAFANA_API_KEY": "${input:grafana_api_key}"
}
}
}
}
}
For Grafana Token ask the admin to create a service user and provide the token
🎯 Use Cases
DevOps Teams
- Incident Response: Quickly assess service health and failure patterns
- Postmortem Analysis: Historical incident data for root cause analysis
- Capacity Planning: Trend analysis and resource utilization monitoring
- Alert Fatigue Management: Identify noisy alerts and optimization opportunities
SRE Teams
- SLI/SLO Monitoring: Service availability and performance tracking
- Error Budget Analysis: Calculate error rates and availability metrics
- Automated Reporting: Generate incident reports and availability summaries
- Proactive Monitoring: Identify patterns before they become critical issues
Development Teams
- Deployment Monitoring: Track deployment success/failure rates
- Performance Regression Detection: Compare metrics across releases
- Integration Testing: Monitor test environment stability
- Feature Flag Impact: Assess performance impact of feature rollouts
🧩 Architecture
Smart Query Processing Pipeline
- Intent Recognition: Parse natural language to understand query type
- Service Detection: Identify target services and components
- Time Range Extraction: Parse temporal expressions into date ranges
- PromQL Generation: Create optimized queries based on intent
- Data Analysis: Process results and calculate meaningful metrics
- Response Formatting: Present data in human-readable format
Supported Query Types
current_alerts
: Active/firing alerts right nowhistorical_alerts
: Past incidents and failure countsservice_availability
: Uptime/downtime analysisdashboard_discovery
: Find relevant monitoring dashboardsmetrics
: General metric queries and analysis
📈 Performance Features
- Intelligent Query Optimization: Automatic step sizing for different time ranges
- Result Caching: Avoid redundant API calls for recent queries
- Timeout Handling: Graceful handling of slow monitoring APIs
- Batch Processing: Efficient handling of multi-service queries
- Memory Management: Optimized for long-running server deployment
🔒 Security & Best Practices
Authentication
- Secure API token storage for Grafana integration
- Support for basic auth with Prometheus/AlertManager
- Environment variable configuration for sensitive data
Network Security
- HTTPS-only connections to monitoring services
- Configurable timeout and retry policies
- Certificate validation for secure connections
Access Control
- Read-only operations by design
- No data modification capabilities
- Audit logging for all monitoring queries
🐛 Troubleshooting
Common Issues
# Connection errors
Error: connect ECONNREFUSED
Solution: Check PROMETHEUS_URL and network connectivity
# Authentication failures
Error: 401 Unauthorized
Solution: Verify API tokens and authentication credentials
# Query timeouts
Error: timeout of 30000ms exceeded
Solution: Reduce query complexity or time range
# No data returned
Warning: No matching metrics found
Solution: Check service names and time range validity
Debug Mode
# Enable verbose logging
DEBUG=monitoring-mcp node dist/index.js
# Check configuration
node -e "console.log(process.env.PROMETHEUS_URL)"
🚀 Advanced Usage
Custom Service Detection
The server automatically recognizes these services:
cleanup-zuultmp
,opengrok
,jenkins
grafana
,prometheus
,alertmanager
gerrit
,nginx
,mysql
,redis
,elasticsearch
Advanced Natural Language Patterns
"How many times did [service] fail in the last [time period]?"
"Show me [severity] alerts for [service] [time range]"
"Count [alert name] incidents in [time period]"
"When was [service] down last [time period]?"
🤝 Contributing
Contributions welcome! Please ensure:
- TypeScript compilation passes (
npm run build
) - Natural language query tests pass
- Documentation updated for new features
- Error handling comprehensive
Built with ❤️ for DevOps and SRE teams who want smarter monitoring interactions