jmzlx/mcp-gpu-cluster
If you are the rightful owner of mcp-gpu-cluster and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.
A Model Context Protocol (MCP) server for managing remote GPU clusters via SSH, enabling Claude Code to interact with remote GPU machines for training, job management, and LLM queries.
MCP GPU Cluster Server
A Model Context Protocol (MCP) server for managing remote GPU clusters via SSH. This enables Claude Code to interact with remote GPU machines for training, job management, and LLM queries.
Features
- GPU Monitoring: Check GPU status, utilization, memory, and processes
- Job Management: Submit, track, and cancel training jobs (single and distributed)
- Code Sync: Clone repos, sync code changes, and manage project files
- LLM Integration: Query local LLMs (vLLM/Ollama) running on cluster
- File Transfer: Upload/download files and model checkpoints
- SSH Execution: Run arbitrary commands on cluster machines
Installation
Via uvx (Recommended)
uvx mcp-gpu-cluster
Via pip
pip install mcp-gpu-cluster
From source
git clone <repo-url>
cd standalone/mcp-gpu-cluster
pip install -e .
Configuration
The server is configured via environment variables. All hosts and paths must be specified:
Required Environment Variables
GPU_CLUSTER_HOSTS: Comma-separated list of GPU cluster hostnames (e.g.,"node-1,node-2")
Optional Environment Variables
GPU_CLUSTER_PROJECT_PATH: Path to project directory on cluster (default:~/Projects/openscope-rl)GPU_CLUSTER_GITHUB_REPO: GitHub repository URL (default:https://github.com/jmzlx/openscope-rl)GPU_CLUSTER_LLM_HOST: LLM server hostname (default: first host inGPU_CLUSTER_HOSTS)GPU_CLUSTER_LLM_PORT: LLM server port (default:8000)GPU_CLUSTER_LLM_BACKEND: LLM backend -vllmorollama(default:vllm)SSH_TIMEOUT: SSH command timeout in seconds (default:300)SSH_CONNECT_TIMEOUT: SSH connection timeout in seconds (default:10)GPU_CLUSTER_MASTER_HOST: Distributed training master host (default: first host)GPU_CLUSTER_MASTER_PORT: Distributed training master port (default:29500)
Claude Code MCP Configuration
Add this to your Claude Code MCP settings:
{
"mcpServers": {
"gpu-cluster": {
"command": "uvx",
"args": ["mcp-gpu-cluster"],
"env": {
"GPU_CLUSTER_HOSTS": "node-1,node-2",
"GPU_CLUSTER_PROJECT_PATH": "~/myproject",
"GPU_CLUSTER_GITHUB_REPO": "https://github.com/username/myproject"
}
}
}
}
Or with all options:
{
"mcpServers": {
"gpu-cluster": {
"command": "uvx",
"args": ["mcp-gpu-cluster"],
"env": {
"GPU_CLUSTER_HOSTS": "spark-1,spark-2",
"GPU_CLUSTER_PROJECT_PATH": "~/Projects/openscope-rl",
"GPU_CLUSTER_GITHUB_REPO": "https://github.com/jmzlx/openscope-rl",
"GPU_CLUSTER_LLM_HOST": "spark-1",
"GPU_CLUSTER_LLM_PORT": "8000",
"GPU_CLUSTER_LLM_BACKEND": "vllm",
"SSH_TIMEOUT": "300",
"SSH_CONNECT_TIMEOUT": "10"
}
}
}
}
Prerequisites
SSH Access
- SSH key-based authentication must be configured for all cluster hosts
- Hosts must be accessible via hostname (add to
~/.ssh/configif needed)
Example ~/.ssh/config:
Host spark-1
HostName 192.168.1.10
User username
IdentityFile ~/.ssh/id_rsa
Host spark-2
HostName 192.168.1.11
User username
IdentityFile ~/.ssh/id_rsa
Cluster Requirements
- GPU machines with NVIDIA GPUs and
nvidia-smiinstalled - Python environment with PyTorch for training
- (Optional) vLLM or Ollama for LLM serving
Available Tools
GPU Monitoring
gpu_status: Get GPU utilization, memory, temperaturegpu_processes: List processes using GPUs
Job Management
submit_job: Submit training job to single hostsubmit_distributed_job: Submit distributed training across multiple hostslist_jobs: List all tracked jobsjob_status: Get detailed job statuscancel_job: Cancel a running jobget_job_logs: Retrieve job logs
Code Management
clone_repo: Clone GitHub repo to clustersync_code: Rsync local changes to clusterpull_latest: Git pull latest changes on cluster
SSH Execution
ssh_run: Execute arbitrary command on clusterssh_run_background: Run command in background
LLM Integration
llm_query: Query local LLM with a promptllm_chat: Multi-turn chat with LLMllm_status: Check LLM server statusllm_list_models: List available modelsllm_start_server: Start LLM server (vLLM/Ollama)llm_stop_server: Stop LLM server
File Transfer
download_file: Download file from clusterupload_file: Upload file to clusterlist_checkpoints: List model checkpoints
Usage Examples
GPU Monitoring
# In Claude Code:
# "Check GPU status on all machines"
# Uses: gpu_status with host="all"
Submit Training Job
# "Submit PPO training on spark-1 using 2 GPUs"
# Uses: submit_job with command="python training/ppo_trainer.py", gpus=2
Distributed Training
# "Run distributed training across all nodes"
# Uses: submit_distributed_job with script="training/dt_trainer.py", gpus_per_node=1
Query Local LLM
# "Ask the local LLM to explain this error message"
# Uses: llm_query with prompt="..."
Sync Code
# "Sync my local code changes to the cluster"
# Uses: sync_code with host="all"
Development
# Clone the repository
git clone <repo-url>
cd standalone/mcp-gpu-cluster
# Install in development mode
pip install -e .
# Run the server
python -m mcp_gpu_cluster
Troubleshooting
SSH Connection Issues
- Verify SSH key-based auth works:
ssh node-1 - Check
~/.ssh/confighas correct host entries - Ensure
StrictHostKeyCheckingis disabled or keys are inknown_hosts
GPU Commands Fail
- Verify
nvidia-smiis installed on cluster:ssh node-1 nvidia-smi - Check CUDA drivers are installed
LLM Server Not Responding
- Verify LLM server is running:
llm_status - Check host and port are correct
- Ensure firewall allows connections to port 8000
Environment Variable Not Set
Error: GPU_CLUSTER_HOSTS environment variable is required
Solution: Set the environment variable in your MCP config:
"env": {
"GPU_CLUSTER_HOSTS": "node-1,node-2"
}
License
MIT License - see LICENSE file for details.
Contributing
Contributions welcome! Please open an issue or pull request.