mcp-gpu-cluster by jmzlx - MCP Server

MCP GPU Cluster Server

A Model Context Protocol (MCP) server for managing remote GPU clusters via SSH. This enables Claude Code to interact with remote GPU machines for training, job management, and LLM queries.

Features

GPU Monitoring: Check GPU status, utilization, memory, and processes
Job Management: Submit, track, and cancel training jobs (single and distributed)
Code Sync: Clone repos, sync code changes, and manage project files
LLM Integration: Query local LLMs (vLLM/Ollama) running on cluster
File Transfer: Upload/download files and model checkpoints
SSH Execution: Run arbitrary commands on cluster machines

Installation

Via uvx (Recommended)

uvx mcp-gpu-cluster

Via pip

pip install mcp-gpu-cluster

From source

git clone <repo-url>
cd standalone/mcp-gpu-cluster
pip install -e .

Configuration

The server is configured via environment variables. All hosts and paths must be specified:

Required Environment Variables

GPU_CLUSTER_HOSTS: Comma-separated list of GPU cluster hostnames (e.g., "node-1,node-2")

Optional Environment Variables

GPU_CLUSTER_PROJECT_PATH: Path to project directory on cluster (default: ~/Projects/openscope-rl)
GPU_CLUSTER_GITHUB_REPO: GitHub repository URL (default: https://github.com/jmzlx/openscope-rl)
GPU_CLUSTER_LLM_HOST: LLM server hostname (default: first host in GPU_CLUSTER_HOSTS)
GPU_CLUSTER_LLM_PORT: LLM server port (default: 8000)
GPU_CLUSTER_LLM_BACKEND: LLM backend - vllm or ollama (default: vllm)
SSH_TIMEOUT: SSH command timeout in seconds (default: 300)
SSH_CONNECT_TIMEOUT: SSH connection timeout in seconds (default: 10)
GPU_CLUSTER_MASTER_HOST: Distributed training master host (default: first host)
GPU_CLUSTER_MASTER_PORT: Distributed training master port (default: 29500)

Claude Code MCP Configuration

Add this to your Claude Code MCP settings:

{
  "mcpServers": {
    "gpu-cluster": {
      "command": "uvx",
      "args": ["mcp-gpu-cluster"],
      "env": {
        "GPU_CLUSTER_HOSTS": "node-1,node-2",
        "GPU_CLUSTER_PROJECT_PATH": "~/myproject",
        "GPU_CLUSTER_GITHUB_REPO": "https://github.com/username/myproject"
      }
    }
  }
}

Or with all options:

{
  "mcpServers": {
    "gpu-cluster": {
      "command": "uvx",
      "args": ["mcp-gpu-cluster"],
      "env": {
        "GPU_CLUSTER_HOSTS": "spark-1,spark-2",
        "GPU_CLUSTER_PROJECT_PATH": "~/Projects/openscope-rl",
        "GPU_CLUSTER_GITHUB_REPO": "https://github.com/jmzlx/openscope-rl",
        "GPU_CLUSTER_LLM_HOST": "spark-1",
        "GPU_CLUSTER_LLM_PORT": "8000",
        "GPU_CLUSTER_LLM_BACKEND": "vllm",
        "SSH_TIMEOUT": "300",
        "SSH_CONNECT_TIMEOUT": "10"
      }
    }
  }
}

Prerequisites

SSH Access

SSH key-based authentication must be configured for all cluster hosts
Hosts must be accessible via hostname (add to ~/.ssh/config if needed)

Example ~/.ssh/config:

Host spark-1
    HostName 192.168.1.10
    User username
    IdentityFile ~/.ssh/id_rsa

Host spark-2
    HostName 192.168.1.11
    User username
    IdentityFile ~/.ssh/id_rsa

Cluster Requirements

GPU machines with NVIDIA GPUs and nvidia-smi installed
Python environment with PyTorch for training
(Optional) vLLM or Ollama for LLM serving

Available Tools

GPU Monitoring

gpu_status: Get GPU utilization, memory, temperature
gpu_processes: List processes using GPUs

Job Management

submit_job: Submit training job to single host
submit_distributed_job: Submit distributed training across multiple hosts
list_jobs: List all tracked jobs
job_status: Get detailed job status
cancel_job: Cancel a running job
get_job_logs: Retrieve job logs

Code Management

clone_repo: Clone GitHub repo to cluster
sync_code: Rsync local changes to cluster
pull_latest: Git pull latest changes on cluster

SSH Execution

ssh_run: Execute arbitrary command on cluster
ssh_run_background: Run command in background

LLM Integration

llm_query: Query local LLM with a prompt
llm_chat: Multi-turn chat with LLM
llm_status: Check LLM server status
llm_list_models: List available models
llm_start_server: Start LLM server (vLLM/Ollama)
llm_stop_server: Stop LLM server

File Transfer

download_file: Download file from cluster
upload_file: Upload file to cluster
list_checkpoints: List model checkpoints

Usage Examples

GPU Monitoring

# In Claude Code:
# "Check GPU status on all machines"
# Uses: gpu_status with host="all"

Submit Training Job

# "Submit PPO training on spark-1 using 2 GPUs"
# Uses: submit_job with command="python training/ppo_trainer.py", gpus=2

Distributed Training

# "Run distributed training across all nodes"
# Uses: submit_distributed_job with script="training/dt_trainer.py", gpus_per_node=1

Query Local LLM

# "Ask the local LLM to explain this error message"
# Uses: llm_query with prompt="..."

Sync Code

# "Sync my local code changes to the cluster"
# Uses: sync_code with host="all"

Development

# Clone the repository
git clone <repo-url>
cd standalone/mcp-gpu-cluster

# Install in development mode
pip install -e .

# Run the server
python -m mcp_gpu_cluster

Troubleshooting

SSH Connection Issues

Verify SSH key-based auth works: ssh node-1
Check ~/.ssh/config has correct host entries
Ensure StrictHostKeyChecking is disabled or keys are in known_hosts

GPU Commands Fail

Verify nvidia-smi is installed on cluster: ssh node-1 nvidia-smi
Check CUDA drivers are installed

LLM Server Not Responding

Verify LLM server is running: llm_status
Check host and port are correct
Ensure firewall allows connections to port 8000

Environment Variable Not Set

Error: GPU_CLUSTER_HOSTS environment variable is required

Solution: Set the environment variable in your MCP config:

"env": {
  "GPU_CLUSTER_HOSTS": "node-1,node-2"
}

License

MIT License - see LICENSE file for details.

Contributing

Contributions welcome! Please open an issue or pull request.