datadog-mcp

sushilti80/datadog-mcp

3.3

If you are the rightful owner of datadog-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

The Enhanced Datadog MCP Server is a robust Model Context Protocol server designed to integrate seamlessly with Datadog's observability platform, providing AI agents with comprehensive access to logs, metrics, and more.

Tools
7
Resources
0
Prompts
0

๐Ÿ• Datadog MCP Server

Enhanced Model Context Protocol (MCP) server for Datadog observability platform

A production-ready MCP server that provides AI agents with intelligent access to Datadog monitoring data, metrics, logs, and advanced troubleshooting workflows through HTTP streamable transport with Server-Sent Events (SSE) support.

FastMCP MCP Python Datadog

โœจ Features

๐Ÿ”ง Core Tools

  • get_metrics - Query timeseries metrics with flexible time ranges
  • list_metrics - Discover available metrics with filtering
  • get_logs - Advanced log search with pagination and time precision
  • get_next_datadog_logs_page - Cursor-based pagination for large log sets
  • get_monitors - Monitor status and management
  • list_dashboards - Dashboard discovery and listing

๐Ÿค– AI-Powered Prompts

  • datadog-metrics-analysis - Automated metrics analysis and insights
  • datadog-performance-diagnosis - Step-by-step performance troubleshooting workflow
  • datadog-incident-commander - Intelligent incident response coordination
  • datadog-time-range-advisor - Smart time range selection guidance

๐Ÿ“Š Smart Resources

  • datadog://metrics/{query} - Real-time metrics with AI analysis
  • datadog://logs/{query} - Intelligent log search and analysis
  • datadog://logs-detailed/{query} - Enhanced log analysis with full context
  • datadog://health-check/{service_name} - Comprehensive service health assessment

โšก Advanced Capabilities

  • Flexible Time Ranges: Support for minutes, hours, days, weeks, and months
  • Intelligent Parameter Handling: Smart defaults with comprehensive validation
  • AI Agent Optimization: Structured workflows for autonomous troubleshooting
  • Production Ready: Comprehensive error handling and logging

๐Ÿš€ Quick Start

Prerequisites

  • Python 3.12+
  • Datadog API Key and Application Key
  • Virtual environment (recommended)

Installation

  1. Clone and setup:
cd fastMCPserver
python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r requirements.txt
  1. Configure environment:
cp .env.example .env
# Edit .env with your Datadog credentials:
# DD_API_KEY=your_datadog_api_key
# DD_APP_KEY=your_datadog_app_key  
# DD_SITE=us3.datadoghq.com  # or your Datadog site
  1. Start the server:
python3 datadog_mcp_server.py

The server will start on http://0.0.0.0:8080/mcp/

๐Ÿ”ง Configuration

Environment Variables

Core Configuration
VariableDescriptionDefaultRequired
DD_API_KEYDatadog API Key-โœ…
DD_APP_KEYDatadog Application Key-โœ…
DD_SITEDatadog Sitedatadoghq.comโŒ
MCP_SERVER_HOSTServer host0.0.0.0โŒ
MCP_SERVER_PORTServer port8080โŒ
Debug Configuration
VariableDescriptionDefaultOptions
MCP_DEBUG_LEVELDebug logging levelINFONONE, INFO, DEBUG, TRACE
MCP_DEBUG_REQUESTSLog incoming MCP requestsfalsetrue, false
MCP_DEBUG_RESPONSESLog outgoing MCP responsesfalsetrue, false
MCP_DEBUG_TIMINGInclude execution timingfalsetrue, false
MCP_DEBUG_PARAMETERSLog function parametersfalsetrue, false
MCP_DEBUG_PRETTY_PRINTPretty print JSON in logstruetrue, false
MCP_DEBUG_ERRORSEnhanced error loggingtruetrue, false
MCP_DEBUG_MASK_SENSITIVEMask API keys in logstruetrue, false
Debug Level Guide
  • NONE: Minimal logging, warnings and errors only
  • INFO: Basic operation logs and debug messages
  • DEBUG: Detailed function calls, API requests/responses
  • TRACE: Full request/response payloads, parameter details

Datadog Sites

  • US1: datadoghq.com (default)
  • US3: us3.datadoghq.com
  • US5: us5.datadoghq.com
  • EU1: datadoghq.eu
  • AP1: ap1.datadoghq.com
  • GOV: ddog-gov.com

๐Ÿ“š API Documentation

Tools

get_metrics

Query timeseries metrics data from Datadog.

Parameters:

  • query (string, required): Datadog metrics query (e.g., "avg:system.cpu.user{*}")
  • hours_back (integer, default: 1): Hours back from now to query
  • minutes_back (integer, optional): Minutes back from now (overrides hours_back)

Examples:

{
  "name": "get_metrics",
  "arguments": {
    "query": "avg:system.cpu.user{*}",
    "minutes_back": 30
  }
}
get_logs

Search Datadog logs with advanced filtering and pagination.

Parameters:

  • query (string, required): Log search query
  • limit (integer, default: 100): Number of logs per page (max 1000)
  • hours_back (integer, optional): Hours back from now to search
  • minutes_back (integer, optional): Minutes back from now (overrides hours_back)
  • from_time (string, optional): Start time in ISO format
  • to_time (string, optional): End time in ISO format
  • indexes (array, optional): List of log indexes to search
  • sort (string, default: "timestamp"): Sort order
  • cursor (string, optional): Pagination cursor
  • max_total_logs (integer, optional): Maximum logs across all pages

Examples:

{
  "name": "get_logs", 
  "arguments": {
    "query": "service:api-gateway AND status:error",
    "minutes_back": 30,
    "limit": 50,
    "sort": "-timestamp"
  }
}
list_metrics

Get list of available metrics with optional filtering.

Parameters:

  • filter_query (string, optional): Filter metrics by name pattern
get_monitors

Get monitors data with state filtering.

Parameters:

  • group_states (array, optional): Filter by states (e.g., ["Alert", "Warn"])
list_dashboards

Get list of available dashboards.

Parameters: None

get_next_datadog_logs_page

Get next page of logs using cursor-based pagination.

Parameters:

  • cursor (string, required): Cursor from previous response
  • limit (integer, default: 100): Number of logs to retrieve

Prompts

datadog-metrics-analysis

Automated metrics analysis with AI insights.

datadog-performance-diagnosis

Structured performance troubleshooting workflow for AI agents.

Parameters:

  • service_name (string): Name of the service to diagnose
  • symptoms (string): Observed performance symptoms
  • severity (string): Issue severity level
datadog-incident-commander

AI-powered incident command and coordination workflow.

Parameters:

  • severity (string): Incident severity (low, medium, high, critical)
  • affected_services (string): Comma-separated list of affected services
  • symptoms (string): Observed incident symptoms
  • estimated_user_impact (string): Estimated user impact percentage
datadog-time-range-advisor

Smart time range selection guidance for different analysis types.

Parameters:

  • analysis_type (string): Type of analysis (performance, security, deployment, capacity)
  • suspected_timeframe (string): When issue might have started
  • incident_impact (string): Impact level

Resources

datadog://metrics/{query}

Real-time metrics data with AI analysis and insights.

datadog://logs/{query}

Intelligent log search with formatted results and metadata.

datadog://logs-detailed/{query}

Enhanced log analysis with full context and detailed breakdown.

datadog://health-check/{service_name}

Comprehensive service health assessment with:

  • Multi-dimensional health scoring
  • Performance metrics analysis
  • Error rate evaluation
  • AI-generated recommendations
  • Business impact translation

โฐ Time Range Examples

Quick Reference

โšก Real-time: minutes_back=15, minutes_back=30
๐Ÿ• Recent: hours_back=1, hours_back=6, hours_back=24  
๐Ÿ“… Weekly: hours_back=168 (7ร—24)
๐Ÿ“Š Monthly: hours_back=720 (30ร—24)
๐Ÿ“ˆ Quarterly: hours_back=2160 (90ร—24)

Common Scenarios

// Active incident (last 15 minutes)
{"query": "status:error", "minutes_back": 15}

// Deployment verification (last 2 hours)  
{"query": "deploy OR release", "hours_back": 2}

// Weekly performance trends
{"query": "slow OR timeout", "hours_back": 168}

// Monthly capacity planning
{"query": "cpu OR memory", "hours_back": 720}

๐Ÿงช Testing with Postman

The server provides a robust HTTP API perfect for testing with Postman. Here's how to get started:

1. Setup Postman Collection

Base URL: http://localhost:8080/mcp/
Method: POST
Headers: Content-Type: application/json

2. Test Tools

List Available Tools
{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "tools/list",
  "params": {}
}
Get Metrics (30 minutes)
{
  "jsonrpc": "2.0", 
  "id": 1,
  "method": "tools/call",
  "params": {
    "name": "get_metrics",
    "arguments": {
      "query": "avg:system.cpu.user{*}",
      "minutes_back": 30
    }
  }
}
Search Logs (Last Hour)
{
  "jsonrpc": "2.0",
  "id": 1, 
  "method": "tools/call",
  "params": {
    "name": "get_logs",
    "arguments": {
      "query": "status:error",
      "hours_back": 1,
      "limit": 50,
      "sort": "-timestamp"
    }
  }
}
Get Time Range Advice
{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "tools/call", 
  "params": {
    "name": "datadog-time-range-advisor",
    "arguments": {
      "analysis_type": "performance",
      "suspected_timeframe": "recent",
      "incident_impact": "high"
    }
  }
}

3. Test Prompts

Performance Diagnosis
{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "tools/call",
  "params": {
    "name": "datadog-performance-diagnosis", 
    "arguments": {
      "service_name": "api-gateway",
      "symptoms": "High response time, increased error rate",
      "severity": "high"
    }
  }
}
Incident Command
{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "tools/call",
  "params": {
    "name": "datadog-incident-commander",
    "arguments": {
      "severity": "critical",
      "affected_services": "api-gateway, database",
      "symptoms": "Service timeout, 500 errors",
      "estimated_user_impact": "25%"
    }
  }
}

4. Test Resources

Health Check Resource
{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "resources/read",
  "params": {
    "uri": "datadog://health-check/api-gateway"
  }
}
Metrics Resource
{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "resources/read", 
  "params": {
    "uri": "datadog://metrics/avg:system.cpu.user{*}"
  }
}

5. Common Testing Scenarios

Error Investigation Workflow
// Step 1: Get recent errors
{"name": "get_logs", "arguments": {"query": "status:error", "minutes_back": 30}}

// Step 2: Check error rate metrics  
{"name": "get_metrics", "arguments": {"query": "sum:trace.http.request.errors{*}", "hours_back": 1}}

// Step 3: Get service health assessment
// Use resources/read with uri: "datadog://health-check/your-service-name"
Performance Analysis Workflow
// Step 1: Get time range advice
{"name": "datadog-time-range-advisor", "arguments": {"analysis_type": "performance"}}

// Step 2: Follow the recommended time ranges
{"name": "get_metrics", "arguments": {"query": "avg:trace.http.request.duration{*}", "minutes_back": 30}}

// Step 3: Get detailed performance diagnosis
{"name": "datadog-performance-diagnosis", "arguments": {"service_name": "api"}}

6. Parameter Tips

โœ… Correct Parameter Types:

{
  "hours_back": 2,          // Number, not string
  "minutes_back": 30,       // Number, not string  
  "limit": 100,             // Number, not string
  "from_time": "2025-09-14T15:00:00Z",  // ISO string
  "indexes": ["main"],      // Array, not string
  "sort": "-timestamp"      // String
}

โŒ Common Mistakes to Avoid:

{
  "hours_back": "2",        // String instead of number
  "minutes_back": "",       // Empty string instead of null/number
  "from_time": "2h",        // Relative time instead of ISO
  "indexes": "main"         // String instead of array
}

๐Ÿ—๏ธ Architecture

Technology Stack

  • FastMCP 2.10.6 - High-performance MCP server framework
  • MCP Protocol 1.12.2 - Model Context Protocol compliance
  • Python 3.12+ - Modern Python with type hints
  • Datadog API Client - Official Datadog Python SDK
  • HTTP + SSE - Streamable transport with Server-Sent Events

Design Principles

  • AI-First Design - Optimized for AI agent interaction patterns
  • Production Ready - Comprehensive error handling, logging, and validation
  • Developer Experience - Clear APIs, helpful error messages, extensive examples
  • Performance Focused - Efficient data handling and smart caching strategies
  • Standards Compliant - Full MCP protocol compliance with FastMCP optimizations

๐Ÿ” Troubleshooting

Common Issues

Server Won't Start
# Check if port is in use
lsof -i :8080

# Check environment variables
echo $DD_API_KEY
echo $DD_APP_KEY

# Check logs for detailed errors
python3 datadog_mcp_server.py
API Authentication Errors
  1. Verify your Datadog API key and Application key
  2. Check your Datadog site setting
  3. Ensure keys have required permissions
Postman Testing Issues
  1. Use Content-Type: application/json header
  2. Send parameters as numbers, not strings (e.g., "hours_back": 2)
  3. Use proper ISO format for time strings
  4. Check server logs for validation errors
Performance Issues
  1. Use smaller time ranges for faster responses
  2. Limit log search results with limit parameter
  3. Use minutes_back for precise short-term analysis
  4. Consider pagination for large datasets
Debug and Tracing

For troubleshooting MCP communication and API issues:

# Enable full debug tracing
export MCP_DEBUG_LEVEL=TRACE
export MCP_DEBUG_REQUESTS=true
export MCP_DEBUG_RESPONSES=true
export MCP_DEBUG_TIMING=true
export MCP_DEBUG_PARAMETERS=true

# Start server with debug enabled
python3 datadog_mcp_server.py

Debug Use Cases:

  • 400 Bad Request Errors: Enable MCP_DEBUG_REQUESTS=true to see exact request payloads
  • Empty Results: Use MCP_DEBUG_LEVEL=DEBUG to trace API calls and responses
  • Performance Issues: Enable MCP_DEBUG_TIMING=true to identify slow operations
  • Parameter Validation: Use MCP_DEBUG_PARAMETERS=true to debug argument parsing

Security Note: In production, keep MCP_DEBUG_MASK_SENSITIVE=true to prevent API keys from appearing in logs.

๐Ÿš€ Production Deployment

Docker Deployment

# Build image
docker build -t datadog-mcp-server .

# Run container
docker run -p 8080:8080 \
  -e DD_API_KEY=your_api_key \
  -e DD_APP_KEY=your_app_key \
  -e DD_SITE=us3.datadoghq.com \
  datadog-mcp-server

Environment-Specific Configuration

  • Development: Use .env file with debug logging
  • Staging: Enable request/response logging for testing
  • Production: Use environment variables, disable debug logs

Monitoring

  • Monitor server health at /mcp/ endpoint
  • Track API rate limits and usage
  • Monitor response times and error rates
  • Set up alerts for authentication failures

๐Ÿค Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Update documentation
  5. Submit a pull request

๐Ÿ“„ License

This project is licensed under the MIT License.

๐Ÿ”— Links


๐ŸŽ‰ Ready to enhance your AI agents with powerful Datadog observability!

Start testing with Postman using the examples above, and explore the intelligent prompts and resources for advanced AI-powered troubleshooting workflows.