mcp-gpu-cluster

jmzlx/mcp-gpu-cluster

3.2

If you are the rightful owner of mcp-gpu-cluster and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.

A Model Context Protocol (MCP) server for managing remote GPU clusters via SSH, enabling Claude Code to interact with remote GPU machines for training, job management, and LLM queries.

Tools
5
Resources
0
Prompts
0

MCP GPU Cluster Server

A Model Context Protocol (MCP) server for managing remote GPU clusters via SSH. This enables Claude Code to interact with remote GPU machines for training, job management, and LLM queries.

Features

  • GPU Monitoring: Check GPU status, utilization, memory, and processes
  • Job Management: Submit, track, and cancel training jobs (single and distributed)
  • Code Sync: Clone repos, sync code changes, and manage project files
  • LLM Integration: Query local LLMs (vLLM/Ollama) running on cluster
  • File Transfer: Upload/download files and model checkpoints
  • SSH Execution: Run arbitrary commands on cluster machines

Installation

Via uvx (Recommended)

uvx mcp-gpu-cluster

Via pip

pip install mcp-gpu-cluster

From source

git clone <repo-url>
cd standalone/mcp-gpu-cluster
pip install -e .

Configuration

The server is configured via environment variables. All hosts and paths must be specified:

Required Environment Variables

  • GPU_CLUSTER_HOSTS: Comma-separated list of GPU cluster hostnames (e.g., "node-1,node-2")

Optional Environment Variables

  • GPU_CLUSTER_PROJECT_PATH: Path to project directory on cluster (default: ~/Projects/openscope-rl)
  • GPU_CLUSTER_GITHUB_REPO: GitHub repository URL (default: https://github.com/jmzlx/openscope-rl)
  • GPU_CLUSTER_LLM_HOST: LLM server hostname (default: first host in GPU_CLUSTER_HOSTS)
  • GPU_CLUSTER_LLM_PORT: LLM server port (default: 8000)
  • GPU_CLUSTER_LLM_BACKEND: LLM backend - vllm or ollama (default: vllm)
  • SSH_TIMEOUT: SSH command timeout in seconds (default: 300)
  • SSH_CONNECT_TIMEOUT: SSH connection timeout in seconds (default: 10)
  • GPU_CLUSTER_MASTER_HOST: Distributed training master host (default: first host)
  • GPU_CLUSTER_MASTER_PORT: Distributed training master port (default: 29500)

Claude Code MCP Configuration

Add this to your Claude Code MCP settings:

{
  "mcpServers": {
    "gpu-cluster": {
      "command": "uvx",
      "args": ["mcp-gpu-cluster"],
      "env": {
        "GPU_CLUSTER_HOSTS": "node-1,node-2",
        "GPU_CLUSTER_PROJECT_PATH": "~/myproject",
        "GPU_CLUSTER_GITHUB_REPO": "https://github.com/username/myproject"
      }
    }
  }
}

Or with all options:

{
  "mcpServers": {
    "gpu-cluster": {
      "command": "uvx",
      "args": ["mcp-gpu-cluster"],
      "env": {
        "GPU_CLUSTER_HOSTS": "spark-1,spark-2",
        "GPU_CLUSTER_PROJECT_PATH": "~/Projects/openscope-rl",
        "GPU_CLUSTER_GITHUB_REPO": "https://github.com/jmzlx/openscope-rl",
        "GPU_CLUSTER_LLM_HOST": "spark-1",
        "GPU_CLUSTER_LLM_PORT": "8000",
        "GPU_CLUSTER_LLM_BACKEND": "vllm",
        "SSH_TIMEOUT": "300",
        "SSH_CONNECT_TIMEOUT": "10"
      }
    }
  }
}

Prerequisites

SSH Access

  • SSH key-based authentication must be configured for all cluster hosts
  • Hosts must be accessible via hostname (add to ~/.ssh/config if needed)

Example ~/.ssh/config:

Host spark-1
    HostName 192.168.1.10
    User username
    IdentityFile ~/.ssh/id_rsa

Host spark-2
    HostName 192.168.1.11
    User username
    IdentityFile ~/.ssh/id_rsa

Cluster Requirements

  • GPU machines with NVIDIA GPUs and nvidia-smi installed
  • Python environment with PyTorch for training
  • (Optional) vLLM or Ollama for LLM serving

Available Tools

GPU Monitoring

  • gpu_status: Get GPU utilization, memory, temperature
  • gpu_processes: List processes using GPUs

Job Management

  • submit_job: Submit training job to single host
  • submit_distributed_job: Submit distributed training across multiple hosts
  • list_jobs: List all tracked jobs
  • job_status: Get detailed job status
  • cancel_job: Cancel a running job
  • get_job_logs: Retrieve job logs

Code Management

  • clone_repo: Clone GitHub repo to cluster
  • sync_code: Rsync local changes to cluster
  • pull_latest: Git pull latest changes on cluster

SSH Execution

  • ssh_run: Execute arbitrary command on cluster
  • ssh_run_background: Run command in background

LLM Integration

  • llm_query: Query local LLM with a prompt
  • llm_chat: Multi-turn chat with LLM
  • llm_status: Check LLM server status
  • llm_list_models: List available models
  • llm_start_server: Start LLM server (vLLM/Ollama)
  • llm_stop_server: Stop LLM server

File Transfer

  • download_file: Download file from cluster
  • upload_file: Upload file to cluster
  • list_checkpoints: List model checkpoints

Usage Examples

GPU Monitoring

# In Claude Code:
# "Check GPU status on all machines"
# Uses: gpu_status with host="all"

Submit Training Job

# "Submit PPO training on spark-1 using 2 GPUs"
# Uses: submit_job with command="python training/ppo_trainer.py", gpus=2

Distributed Training

# "Run distributed training across all nodes"
# Uses: submit_distributed_job with script="training/dt_trainer.py", gpus_per_node=1

Query Local LLM

# "Ask the local LLM to explain this error message"
# Uses: llm_query with prompt="..."

Sync Code

# "Sync my local code changes to the cluster"
# Uses: sync_code with host="all"

Development

# Clone the repository
git clone <repo-url>
cd standalone/mcp-gpu-cluster

# Install in development mode
pip install -e .

# Run the server
python -m mcp_gpu_cluster

Troubleshooting

SSH Connection Issues

  • Verify SSH key-based auth works: ssh node-1
  • Check ~/.ssh/config has correct host entries
  • Ensure StrictHostKeyChecking is disabled or keys are in known_hosts

GPU Commands Fail

  • Verify nvidia-smi is installed on cluster: ssh node-1 nvidia-smi
  • Check CUDA drivers are installed

LLM Server Not Responding

  • Verify LLM server is running: llm_status
  • Check host and port are correct
  • Ensure firewall allows connections to port 8000

Environment Variable Not Set

Error: GPU_CLUSTER_HOSTS environment variable is required

Solution: Set the environment variable in your MCP config:

"env": {
  "GPU_CLUSTER_HOSTS": "node-1,node-2"
}

License

MIT License - see LICENSE file for details.

Contributing

Contributions welcome! Please open an issue or pull request.