warrenzhu25/dataproc-mcp
If you are the rightful owner of dataproc-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.
The Dataproc MCP Server is a Model Context Protocol server designed to manage Google Cloud Dataproc clusters and jobs, providing a standardized interface for AI assistants to interact with Dataproc resources.
Dataproc MCP Server
A Model Context Protocol (MCP) server that provides tools for managing Google Cloud Dataproc clusters and jobs. This server enables AI assistants to interact with Dataproc resources through a standardized interface.
Features
Cluster Management
- List Clusters: View all clusters in a project and region
- Create Cluster: Provision new Dataproc clusters with custom configurations
- Delete Cluster: Remove existing clusters
- Get Cluster: Retrieve detailed information about specific clusters
Job Management
- Submit Jobs: Run Spark, PySpark, Spark SQL, Hive, Pig, and Hadoop jobs
- List Jobs: View jobs across clusters with filtering options
- Get Job: Retrieve detailed job information and status
- Cancel Job: Stop running jobs
Batch Operations
- Create Batch Jobs: Submit serverless Dataproc batch jobs
- List Batch Jobs: View all batch jobs in a region
- Get Batch Job: Retrieve detailed batch job information
- Delete Batch Job: Remove batch jobs
Installation
Prerequisites
- Python 3.11 or higher (Python 3.13+ recommended)
- Google Cloud SDK configured with appropriate permissions
- Dataproc API enabled in your Google Cloud project
Install from Source
# Clone the repository
git clone https://github.com/warrenzhu25/dataproc-mcp.git
cd dataproc-mcp
# Create virtual environment (recommended for Homebrew Python)
python3 -m venv .venv
source .venv/bin/activate
# Install project dependencies
pip install -e .
# Install development dependencies (optional)
pip install -e ".[dev]"
Alternative Installation Methods
# With uv (if available)
uv pip install --system -e .
# With uv development dependencies
uv pip install --system -e ".[dev]"
Troubleshooting Installation
If you encounter issues:
-
Python version errors: Ensure you have Python 3.11+ installed
python --version # Should be 3.11 or higher -
Externally managed environment errors: Use a virtual environment
python3 -m venv .venv source .venv/bin/activate -
Missing module errors: Make sure dependencies are installed
pip install -e .
Configuration
Authentication
The server supports multiple authentication methods:
-
Service Account Key (Recommended for production):
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json" -
Application Default Credentials:
gcloud auth application-default login -
Compute Engine Service Account (when running on GCE)
Required Permissions
Ensure your service account or user has the following IAM roles:
roles/dataproc.editor- For cluster and job managementroles/storage.objectViewer- For accessing job files in Cloud Storageroles/compute.networkUser- For VPC network access (if using custom networks)
Usage
Running the Server
First, activate your virtual environment (if using one):
source .venv/bin/activate
The server supports multiple transport protocols:
# STDIO (default) - for command-line tools and MCP clients
python -m dataproc_mcp_server
# HTTP - REST API over HTTP using streamable-http transport
DATAPROC_MCP_TRANSPORT=http python -m dataproc_mcp_server
# SSE - Server-Sent Events for real-time communication
DATAPROC_MCP_TRANSPORT=sse python -m dataproc_mcp_server
# Run with entry point script (STDIO only)
dataproc-mcp-server
Transport Configuration
- STDIO (default): Standard input/output communication for command-line tools and MCP clients
- HTTP: REST API over HTTP using streamable-http transport
- Server URL:
http://localhost:8000/mcp - Accessible via web clients and HTTP-based MCP clients
- Server URL:
- SSE: Server-Sent Events for real-time bidirectional communication
- Server URL:
http://localhost:8000/sse - Supports streaming responses and live updates
- Server URL:
Environment Variables
# Transport type (stdio, http, sse)
export DATAPROC_MCP_TRANSPORT=http
# Server host (for HTTP/SSE transports)
export DATAPROC_MCP_HOST=0.0.0.0
# Enable debug logging (true, 1, yes to enable)
export DATAPROC_MCP_DEBUG=true
# Server port (for HTTP/SSE transports)
export DATAPROC_MCP_PORT=8080
# Authentication
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"
MCP Client Configuration
Add to your MCP client configuration:
{
"mcpServers": {
"dataproc": {
"command": "python",
"args": ["-m", "dataproc_mcp_server"],
"env": {
"GOOGLE_APPLICATION_CREDENTIALS": "/path/to/service-account.json",
"DATAPROC_MCP_DEBUG": "true"
}
}
}
}
Testing with MCP Inspector
You can test the server using the official MCP Inspector:
# Test STDIO transport
npx @modelcontextprotocol/inspector python -m dataproc_mcp_server
# Test HTTP transport with debug logging
DATAPROC_MCP_TRANSPORT=http DATAPROC_MCP_DEBUG=true python -m dataproc_mcp_server &
npx @modelcontextprotocol/inspector --transport http --server-url http://127.0.0.1:8000/mcp
# Test SSE transport
DATAPROC_MCP_TRANSPORT=sse python -m dataproc_mcp_server &
npx @modelcontextprotocol/inspector --transport sse --server-url http://127.0.0.1:8000/sse
The MCP Inspector provides a web interface to:
- Browse available tools and resources
- Test tool calls with custom parameters
- View real-time protocol messages
- Debug server responses
Example Tool Usage
Create a Cluster
{
"name": "create_cluster",
"arguments": {
"project_id": "my-project",
"region": "us-central1",
"cluster_name": "my-cluster",
"num_instances": 3,
"machine_type": "n1-standard-4",
"disk_size_gb": 100,
"image_version": "2.1-debian11"
}
}
Submit a PySpark Job
{
"name": "submit_job",
"arguments": {
"project_id": "my-project",
"region": "us-central1",
"cluster_name": "my-cluster",
"job_type": "pyspark",
"main_file": "gs://my-bucket/my-script.py",
"args": ["--input", "gs://my-bucket/input", "--output", "gs://my-bucket/output"],
"properties": {
"spark.executor.memory": "4g",
"spark.executor.instances": "3"
}
}
}
Create a Batch Job
{
"name": "create_batch_job",
"arguments": {
"project_id": "my-project",
"region": "us-central1",
"batch_id": "my-batch-job",
"job_type": "pyspark",
"main_file": "gs://my-bucket/batch-script.py",
"service_account": "my-service-account@my-project.iam.gserviceaccount.com"
}
}
Development
Setup Development Environment
# Install development dependencies
uv pip install --system -e ".[dev]"
# Or with pip
pip install -e ".[dev]"
Running Tests
# Run all tests
pytest
# Run with coverage
python -m pytest --cov=src/dataproc_mcp_server tests/
# Run specific test file
pytest tests/test_dataproc_client.py -v
Code Quality
# Format code
ruff format src/ tests/
# Lint code
ruff check src/ tests/
# Type checking (with VS Code + Pylance or mypy)
mypy src/
Project Structure
dataproc-mcp/
├── src/dataproc_mcp_server/
│ ├── __init__.py
│ ├── __main__.py # Entry point
│ ├── server.py # MCP server implementation
│ ├── dataproc_client.py # Dataproc cluster/job operations
│ └── batch_client.py # Dataproc batch operations
├── tests/
│ ├── __init__.py
│ ├── test_server.py
│ └── test_dataproc_client.py
├── examples/
│ ├── mcp_server_config.json
│ └── example_usage.py
├── pyproject.toml
├── CLAUDE.md # Development guide
└── README.md
Troubleshooting
Common Issues
-
Authentication Errors:
- Verify
GOOGLE_APPLICATION_CREDENTIALSis set correctly - Ensure service account has required permissions
- Check that Dataproc API is enabled
- Verify
-
Network Errors:
- Verify VPC/subnet configurations for custom networks
- Check firewall rules for cluster communication
- Ensure clusters are in the correct region
-
Job Submission Failures:
- Verify file paths in Cloud Storage are accessible
- Check cluster has sufficient resources
- Validate job configuration parameters
Debug Mode
Enable debug logging:
export PYTHONPATH=/path/to/dataproc-mcp/src
python -c "
import logging
logging.basicConfig(level=logging.DEBUG)
from dataproc_mcp_server import __main__
import asyncio
asyncio.run(__main__.main())
"
API Reference
Tools
Cluster Management
list_clusters(project_id, region)- List all clusterscreate_cluster(project_id, region, cluster_name, ...)- Create clusterdelete_cluster(project_id, region, cluster_name)- Delete clusterget_cluster(project_id, region, cluster_name)- Get cluster details
Job Management
submit_job(project_id, region, cluster_name, job_type, main_file, ...)- Submit joblist_jobs(project_id, region, cluster_name?, job_states?)- List jobsget_job(project_id, region, job_id)- Get job detailscancel_job(project_id, region, job_id)- Cancel job
Batch Operations
create_batch_job(project_id, region, batch_id, job_type, main_file, ...)- Create batch joblist_batch_jobs(project_id, region, page_size?)- List batch jobsget_batch_job(project_id, region, batch_id)- Get batch job detailsdelete_batch_job(project_id, region, batch_id)- Delete batch job
Resources
dataproc://clusters- Access cluster informationdataproc://jobs- Access job information
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Run the test suite and linting
- Submit a pull request
License
MIT License - see LICENSE file for details.
Support
For issues and questions:
- Check the troubleshooting section
- Review Google Cloud Dataproc documentation
- Open an issue in the repository