dataproc-mcp by warrenzhu25 - MCP Server

Dataproc MCP Server

A Model Context Protocol (MCP) server that provides tools for managing Google Cloud Dataproc clusters and jobs. This server enables AI assistants to interact with Dataproc resources through a standardized interface.

Features

Cluster Management

List Clusters: View all clusters in a project and region
Create Cluster: Provision new Dataproc clusters with custom configurations
Delete Cluster: Remove existing clusters
Get Cluster: Retrieve detailed information about specific clusters

Job Management

Submit Jobs: Run Spark, PySpark, Spark SQL, Hive, Pig, and Hadoop jobs
List Jobs: View jobs across clusters with filtering options
Get Job: Retrieve detailed job information and status
Cancel Job: Stop running jobs

Batch Operations

Create Batch Jobs: Submit serverless Dataproc batch jobs
List Batch Jobs: View all batch jobs in a region
Get Batch Job: Retrieve detailed batch job information
Delete Batch Job: Remove batch jobs

Installation

Prerequisites

Python 3.11 or higher (Python 3.13+ recommended)
Google Cloud SDK configured with appropriate permissions
Dataproc API enabled in your Google Cloud project

Install from Source

# Clone the repository
git clone https://github.com/warrenzhu25/dataproc-mcp.git
cd dataproc-mcp

# Create virtual environment (recommended for Homebrew Python)
python3 -m venv .venv
source .venv/bin/activate

# Install project dependencies
pip install -e .

# Install development dependencies (optional)
pip install -e ".[dev]"

Alternative Installation Methods

# With uv (if available)
uv pip install --system -e .

# With uv development dependencies
uv pip install --system -e ".[dev]"

Troubleshooting Installation

If you encounter issues:

Python version errors: Ensure you have Python 3.11+ installed
```
python --version  # Should be 3.11 or higher
```
Externally managed environment errors: Use a virtual environment
```
python3 -m venv .venv
source .venv/bin/activate
```
Missing module errors: Make sure dependencies are installed
```
pip install -e .
```

Configuration

Authentication

The server supports multiple authentication methods:

Service Account Key (Recommended for production):

export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"

Application Default Credentials:
```
gcloud auth application-default login
```
Compute Engine Service Account (when running on GCE)

Required Permissions

Ensure your service account or user has the following IAM roles:

roles/dataproc.editor - For cluster and job management
roles/storage.objectViewer - For accessing job files in Cloud Storage
roles/compute.networkUser - For VPC network access (if using custom networks)

Usage

Running the Server

First, activate your virtual environment (if using one):

source .venv/bin/activate

The server supports multiple transport protocols:

# STDIO (default) - for command-line tools and MCP clients
python -m dataproc_mcp_server

# HTTP - REST API over HTTP using streamable-http transport
DATAPROC_MCP_TRANSPORT=http python -m dataproc_mcp_server

# SSE - Server-Sent Events for real-time communication
DATAPROC_MCP_TRANSPORT=sse python -m dataproc_mcp_server

# Run with entry point script (STDIO only)
dataproc-mcp-server

Transport Configuration

STDIO (default): Standard input/output communication for command-line tools and MCP clients
HTTP: REST API over HTTP using streamable-http transport
- Server URL: http://localhost:8000/mcp
- Accessible via web clients and HTTP-based MCP clients
SSE: Server-Sent Events for real-time bidirectional communication
- Server URL: http://localhost:8000/sse
- Supports streaming responses and live updates

Environment Variables

# Transport type (stdio, http, sse)
export DATAPROC_MCP_TRANSPORT=http

# Server host (for HTTP/SSE transports)
export DATAPROC_MCP_HOST=0.0.0.0

# Enable debug logging (true, 1, yes to enable)
export DATAPROC_MCP_DEBUG=true

# Server port (for HTTP/SSE transports)
export DATAPROC_MCP_PORT=8080

# Authentication
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"

MCP Client Configuration

Add to your MCP client configuration:

{
  "mcpServers": {
    "dataproc": {
      "command": "python",
      "args": ["-m", "dataproc_mcp_server"],
      "env": {
        "GOOGLE_APPLICATION_CREDENTIALS": "/path/to/service-account.json",
        "DATAPROC_MCP_DEBUG": "true"
      }
    }
  }
}

Testing with MCP Inspector

You can test the server using the official MCP Inspector:

# Test STDIO transport
npx @modelcontextprotocol/inspector python -m dataproc_mcp_server

# Test HTTP transport with debug logging
DATAPROC_MCP_TRANSPORT=http DATAPROC_MCP_DEBUG=true python -m dataproc_mcp_server &
npx @modelcontextprotocol/inspector --transport http --server-url http://127.0.0.1:8000/mcp

# Test SSE transport  
DATAPROC_MCP_TRANSPORT=sse python -m dataproc_mcp_server &
npx @modelcontextprotocol/inspector --transport sse --server-url http://127.0.0.1:8000/sse

The MCP Inspector provides a web interface to:

Browse available tools and resources
Test tool calls with custom parameters
View real-time protocol messages
Debug server responses

Example Tool Usage

Create a Cluster

{
  "name": "create_cluster",
  "arguments": {
    "project_id": "my-project",
    "region": "us-central1",
    "cluster_name": "my-cluster",
    "num_instances": 3,
    "machine_type": "n1-standard-4",
    "disk_size_gb": 100,
    "image_version": "2.1-debian11"
  }
}

Submit a PySpark Job

{
  "name": "submit_job",
  "arguments": {
    "project_id": "my-project",
    "region": "us-central1", 
    "cluster_name": "my-cluster",
    "job_type": "pyspark",
    "main_file": "gs://my-bucket/my-script.py",
    "args": ["--input", "gs://my-bucket/input", "--output", "gs://my-bucket/output"],
    "properties": {
      "spark.executor.memory": "4g",
      "spark.executor.instances": "3"
    }
  }
}

Create a Batch Job

{
  "name": "create_batch_job",
  "arguments": {
    "project_id": "my-project",
    "region": "us-central1",
    "batch_id": "my-batch-job",
    "job_type": "pyspark",
    "main_file": "gs://my-bucket/batch-script.py",
    "service_account": "my-service-account@my-project.iam.gserviceaccount.com"
  }
}

Development

Setup Development Environment

# Install development dependencies
uv pip install --system -e ".[dev]"

# Or with pip
pip install -e ".[dev]"

Running Tests

# Run all tests
pytest

# Run with coverage
python -m pytest --cov=src/dataproc_mcp_server tests/

# Run specific test file
pytest tests/test_dataproc_client.py -v

Code Quality

# Format code
ruff format src/ tests/

# Lint code
ruff check src/ tests/

# Type checking (with VS Code + Pylance or mypy)
mypy src/

Project Structure

dataproc-mcp/
├── src/dataproc_mcp_server/
│   ├── __init__.py
│   ├── __main__.py           # Entry point
│   ├── server.py             # MCP server implementation
│   ├── dataproc_client.py    # Dataproc cluster/job operations
│   └── batch_client.py       # Dataproc batch operations
├── tests/
│   ├── __init__.py
│   ├── test_server.py
│   └── test_dataproc_client.py
├── examples/
│   ├── mcp_server_config.json
│   └── example_usage.py
├── pyproject.toml
├── CLAUDE.md                 # Development guide
└── README.md

Troubleshooting

Common Issues

Authentication Errors:
- Verify GOOGLE_APPLICATION_CREDENTIALS is set correctly
- Ensure service account has required permissions
- Check that Dataproc API is enabled
Network Errors:
- Verify VPC/subnet configurations for custom networks
- Check firewall rules for cluster communication
- Ensure clusters are in the correct region
Job Submission Failures:
- Verify file paths in Cloud Storage are accessible
- Check cluster has sufficient resources
- Validate job configuration parameters

Debug Mode

Enable debug logging:

export PYTHONPATH=/path/to/dataproc-mcp/src
python -c "
import logging
logging.basicConfig(level=logging.DEBUG)
from dataproc_mcp_server import __main__
import asyncio
asyncio.run(__main__.main())
"

API Reference

Tools

Cluster Management

list_clusters(project_id, region) - List all clusters
create_cluster(project_id, region, cluster_name, ...) - Create cluster
delete_cluster(project_id, region, cluster_name) - Delete cluster
get_cluster(project_id, region, cluster_name) - Get cluster details

Job Management

submit_job(project_id, region, cluster_name, job_type, main_file, ...) - Submit job
list_jobs(project_id, region, cluster_name?, job_states?) - List jobs
get_job(project_id, region, job_id) - Get job details
cancel_job(project_id, region, job_id) - Cancel job

Batch Operations

create_batch_job(project_id, region, batch_id, job_type, main_file, ...) - Create batch job
list_batch_jobs(project_id, region, page_size?) - List batch jobs
get_batch_job(project_id, region, batch_id) - Get batch job details
delete_batch_job(project_id, region, batch_id) - Delete batch job

Resources

dataproc://clusters - Access cluster information
dataproc://jobs - Access job information

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests for new functionality
Run the test suite and linting
Submit a pull request

License

MIT License - see LICENSE file for details.

Support

For issues and questions:

Check the troubleshooting section
Review Google Cloud Dataproc documentation
Open an issue in the repository