mcp-amdsmi by AMD-melliott - MCP Server

AMD SMI MCP Server

An intelligent Model Context Protocol (MCP) server that provides conversational access to AMD GPU monitoring capabilities through the FastMCP framework. Designed for infrastructure management, performance analysis, and workshop demonstrations.

Features

Six Core Monitoring Tools: Device discovery, status monitoring, performance analysis, memory analysis, power/thermal monitoring, and health assessment
Intelligent Health Analysis: AI-powered health scoring with contextual recommendations
N/A Value Handling: Robust handling of missing or unavailable metrics without failures
FastMCP Integration: Modern MCP implementation with proper tool registration and error handling
Demo Mode Support: Works on development systems without enterprise GPUs

Quick Start

Prerequisites

Python 3.11+
AMD GPU with ROCm/AMD SMI installed (or any system for demo mode)
Git

Installation

Option 1: Install as a Package (Recommended)

Clone the repository:

git clone <repository-url>
cd mcp-amdsmi

Install the package:
```
pip install -e .
```
Test the installation:
```
mcp-amdsmi --help
```

Option 2: Development Installation

Clone and create virtual environment:

git clone <repository-url>
cd mcp-amdsmi
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install in development mode:
```
pip install -e .
```
Test the installation:
```
python test_monitoring.py
```

Running the MCP Server

Add this simplified configuration to your MCP client:

{
  "mcpServers": {
    "mcp-amdsmi": {
      "command": "mcp-amdsmi"
    }
  }
}

Or if you want to use a specific installation:

{
  "mcpServers": {
    "mcp-amdsmi": {
      "command": "/path/to/venv/bin/mcp-amdsmi"
    }
  }
}

Available Tools

1. `get_gpu_discovery`

Discovers and enumerates all available AMD GPU devices.

Returns device information, driver versions, and hardware specifications
Works in both real hardware and demo modes

2. `get_gpu_status`

Provides comprehensive current status of a specific GPU.

Temperature, power, utilization, memory, clock speeds, and fan data
Includes overall health score (0-100)
Parameters: device_id (string, default: "0")

3. `get_gpu_performance`

Analyzes GPU performance metrics and efficiency.

Performance analysis with efficiency scoring
Utilization patterns and bottleneck identification
Parameters: device_id (string, default: "0")

4. `analyze_gpu_memory`

Detailed GPU memory usage analysis.

Memory health assessment
Usage patterns and recommendations
Parameters: device_id (string, default: "0")

5. `monitor_power_thermal`

Monitors GPU power consumption and thermal status.

Real-time power and temperature data
Thermal warnings and power efficiency metrics
Parameters: device_id (string, default: "0")

6. `check_gpu_health`

Comprehensive GPU health assessment with recommendations.

Overall health status and scoring
Issue detection and actionable recommendations
Parameters: device_id (string, default: "0")

Example Usage

Once integrated with Claude Code, you can use natural language queries:

"What GPUs are available in the system?"
"Check the health of GPU 0"
"Show me the performance metrics for all GPUs"
"Is GPU 0 running too hot?"
"Analyze memory usage patterns"

Architecture

The system consists of three main layers:

AMD SMI Interface Layer (AMDSMIManager) - Abstracts AMD SMI Python API with robust error handling
Business Logic Layer (HealthAnalyzer, PerformanceInterpreter) - Provides intelligent analysis and recommendations
MCP Server Layer (FastMCP-based) - Exposes functionality as conversational tools

Demo Mode

The server automatically falls back to demo mode when:

No AMD GPUs are detected
AMD SMI library is unavailable
Hardware access fails

Demo mode provides realistic mock data for development and testing.

N/A Value Handling

The server gracefully handles missing or "N/A" values common in:

Development environments
Limited hardware access scenarios
Partial metric availability

Missing values receive neutral health scores (80.0) and don't cause failures.

Development

Project Structure

mcp-amdsmi/
├── src/amd_smi_mcp/
│   ├── server.py              # FastMCP server with tool definitions
│   ├── amd_smi_wrapper.py     # AMD SMI library abstraction
│   └── business_logic.py      # Health analysis and performance interpretation
├── tests/                     # Unit and integration tests
├── test_monitoring.py         # Comprehensive test script
├── requirements.txt           # Python dependencies
└── README.md                  # This file

Running Tests

source venv/bin/activate
python test_monitoring.py      # Comprehensive functionality test
pytest                         # Unit tests (if available)

Code Quality

source venv/bin/activate
black src/                     # Code formatting
flake8 src/                    # Linting
mypy src/                      # Type checking

Workshop Integration

Designed for PEARC25 workshop demonstrations:

30-second response times for single GPU queries
Support for 30 concurrent users
Educational explanations and visual indicators
Fallback modes for demonstration reliability

Troubleshooting

Common Issues

1. "amdsmi library not available"

Install ROCm and AMD SMI library
Server will automatically use demo mode if unavailable

2. "No AMD GPU devices found"

Check GPU hardware installation
Verify driver installation
Server continues in demo mode

3. "Permission denied" errors

Ensure user has GPU access permissions
May require adding user to appropriate groups

4. Import errors in Claude Code

Verify cwd and PYTHONPATH in MCP configuration
Ensure virtual environment activation if using venv

Logging

Enable debug logging by setting environment variable:

export PYTHONPATH=/path/to/mcp-amdsmi
export LOG_LEVEL=DEBUG

License

[License information to be added]

Contributing

[Contributing guidelines to be added]

AMD-melliott/mcp-amdsmi

get_gpu_discovery

get_gpu_status

get_gpu_performance

analyze_gpu_memory

monitor_power_thermal

check_gpu_health