data-graphql-agent

opendedup/data-graphql-agent

3.2

If you are the rightful owner of data-graphql-agent and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.

The Data GraphQL Agent is an MCP agent designed to generate production-ready Apollo GraphQL servers from BigQuery SQL queries, with integrated Dataplex lineage tracking.

Tools
2
Resources
0
Prompts
0

Data GraphQL Agent

MCP (Model Context Protocol) agent that generates production-ready Apollo GraphQL servers from BigQuery SQL queries with Dataplex lineage tracking.

Features

  • 🚀 Auto-generate Apollo GraphQL Servers from BigQuery queries
  • 📊 BigQuery Integration with type inference from SQL schemas
  • 📝 Dataplex Lineage Tracking for end-to-end data governance
  • 🐳 Docker Support for containerized deployments
  • 🧪 Test Client Generation for API validation
  • 🔌 MCP Protocol for seamless integration with Cursor and other AI assistants

How It Works

End-to-End Flow

1. Input         →  2. Schema Inference  →  3. Code Generation  →  4. Validation  →  5. Output
BigQuery SQL        Dry-run Analysis        Jinja2 Templates       Multi-level      GCS/Local
Queries             Type Mapping            Apollo Server v4       Checks           Files

Detailed Steps:

  1. Input: You provide BigQuery SQL queries via MCP tool
  2. Schema Inference: Agent runs BigQuery dry-run to infer result types
  3. Code Generation: Generates complete Apollo Server project with templates
  4. Validation (optional): Validates generated code at selected level
  5. Output: Writes validated code to GCS or local filesystem
  6. Deployment: You run the generated Node.js application

Validation Levels

Choose validation thoroughness based on your needs:

LevelTimeCoverageChecksUse Case
Quick~1s80%GraphQL syntax, SQL dry-run, file structureRapid iteration, development
Standard~10s95%Quick + TypeScript compilation, importsDefault, balanced approach
Full~60s99%Standard + Docker build, server startup, health checkPre-production, CI/CD

Architecture

The agent generates a complete TypeScript/Node.js project with:

  • Apollo Server v4 - GraphQL API server with plugins and context
  • Type-safe resolvers - Auto-generated from BigQuery schemas
  • Dataplex integration - Runtime lineage event tracking
  • Error handling - Production-safe error formatting
  • Docker configuration - Multi-stage builds for production
  • Test suite - Integration tests and test client

Installation

Prerequisites

  • Python 3.10-3.12
  • Poetry (Python dependency management)
  • Google Cloud account with BigQuery access

Setup

# Clone the repository
git clone https://github.com/opendedup/data-graphql-agent.git
cd data-graphql-agent

# Install dependencies
poetry install

# Configure environment variables
cp .env.example .env
# Edit .env with your GCP credentials

Configuration

Create a .env file or set environment variables:

# GCP Configuration
GCP_PROJECT_ID=your-project-id
GCP_LOCATION=us-central1

# Output Configuration
GRAPHQL_OUTPUT_DIR=gs://your-bucket/graphql-server
# Or local path: GRAPHQL_OUTPUT_DIR=/path/to/output

# MCP Server Configuration
MCP_TRANSPORT=stdio  # or http
MCP_HOST=0.0.0.0
MCP_PORT=8080

Usage

As MCP Server (Recommended)

Configure in Cursor's mcp.json:

{
  "mcpServers": {
    "data-graphql-agent": {
      "command": "poetry",
      "args": ["run", "python", "-m", "data_graphql_agent.mcp"],
      "cwd": "/path/to/data-graphql-agent",
      "env": {
        "GCP_PROJECT_ID": "your-project",
        "GRAPHQL_OUTPUT_DIR": "gs://your-bucket/graphql-server"
      }
    }
  }
}

Direct Python Usage

from data_graphql_agent.generation import ProjectGenerator
from data_graphql_agent.clients import StorageClient
from data_graphql_agent.models import QueryInput

# Define queries
queries = [
    QueryInput(
        query_name="trendingItems",
        sql="SELECT item, SUM(sales) as total FROM `project.dataset.sales` GROUP BY item",
        source_tables=["project.dataset.sales"]
    )
]

# Generate project
generator = ProjectGenerator(project_id="your-project")
files = generator.generate_project("my-project", queries)

# Write to storage
storage = StorageClient(project_id="your-project")
manifests = storage.write_files("gs://bucket/output", files)

Running as HTTP Server

# Set transport to HTTP
export MCP_TRANSPORT=http
export MCP_PORT=8080

# Start server
poetry run python -m data_graphql_agent.mcp

Then call tools via HTTP:

curl -X POST http://localhost:8080/mcp/call-tool \
  -H "Content-Type: application/json" \
  -d '{
    "name": "generate_graphql_api",
    "arguments": {
      "queries": [...],
      "project_name": "my-project"
    }
  }'

MCP Tools

generate_graphql_api

Generates a complete Apollo GraphQL Server project with validation.

Input:

  • queries: Array of query objects with queryName, sql, and source_tables
  • project_name: Project name for lineage tracking
  • output_path: Optional output location (defaults to GRAPHQL_OUTPUT_DIR)
  • validation_level: Optional validation thoroughness - "quick", "standard" (default), or "full"
  • auto_fix: Optional boolean to attempt automatic error fixes (default: false)

Output:

  • Complete TypeScript/Node.js project
  • Docker configuration
  • Test client
  • Integration tests
  • Validation results with checks passed and warnings

Example with Validation:

result = await handle_generate_graphql_api({
    "queries": [
        {
            "queryName": "salesByRegion",
            "sql": "SELECT region, SUM(amount) as total FROM `project.dataset.sales` GROUP BY region",
            "source_tables": ["project.dataset.sales"]
        }
    ],
    "project_name": "analytics-api",
    "output_path": "./output",
    "validation_level": "standard",  # Quick validation for speed
    "auto_fix": false
})

Success Response:

{
  "success": true,
  "output_path": "./output",
  "files_generated": [...],
  "message": "Successfully generated and validated Apollo GraphQL Server with 1 queries. Generated 15 files at ./output. Validation: 5 checks passed in 8.2s"
}

Validation Failure Response:

{
  "success": false,
  "output_path": "./output",
  "files_generated": [],
  "message": "Code validation failed at standard level",
  "error": "Validation errors: Invalid SQL in query 'salesByRegion': Table not found; TypeScript compilation failed"
}

validate_graphql_schema

Validates a GraphQL schema file.

Input:

  • schema_path: Path to schema file

Output:

  • Validation results with errors and warnings

Generated Project Structure

graphql-server/
├── src/
│   ├── server.ts          # Main Apollo Server
│   ├── typeDefs.ts        # GraphQL schema
│   ├── resolvers.ts       # Query resolvers
│   └── lineage.ts         # Dataplex integration
├── test-client/           # Test client
├── tests/                 # Integration tests
├── package.json
├── tsconfig.json
├── Dockerfile
└── docker-compose.yml

Running Generated Server

cd output/graphql-server

# Install dependencies
npm install

# Development mode
npm run dev

# Production build
npm run build
npm start

# Docker
docker-compose up --build

Development

Running Tests

# Run all tests
poetry run pytest

# Run unit tests only
poetry run pytest tests/unit

# Run with coverage
poetry run pytest --cov=data_graphql_agent

Code Formatting

# Format with Black
poetry run black src tests

# Lint with Ruff
poetry run ruff check src tests

BigQuery Type Mapping

The agent automatically maps BigQuery types to GraphQL types:

BigQuery TypeGraphQL Type
STRINGString
INT64Int
FLOAT64Float
BOOLBoolean
TIMESTAMP/DATEString (ISO 8601)
STRUCTCustom Object Type
ARRAY[Type]

Nested structures (STRUCTs and ARRAYs) are fully supported with automatic type generation.

Validation Benefits

Why Validate Before Writing?

  1. Catch errors early - Invalid SQL, type mismatches, and syntax errors detected before deployment
  2. Faster iteration - No manual debugging of generated code
  3. Confidence - Know your code will work before running npm install
  4. Cost savings - Avoid wasted GCS writes and Docker builds for broken code
  5. CI/CD friendly - Use full validation in pipelines for guaranteed deployments

When to Use Which Level?

Quick Validation (~1s)

  • ✅ Rapid prototyping and experimentation
  • ✅ Iterating on SQL queries
  • ✅ Testing query-to-schema mappings
  • ❌ Not for production deployments

Standard Validation (~10s) - Recommended Default

  • ✅ Normal development workflow
  • ✅ Before committing to version control
  • ✅ Balanced speed and thoroughness
  • ✅ Most common use case

Full Validation (~60s)

  • ✅ Pre-production deployments
  • ✅ CI/CD pipelines
  • ✅ Critical production updates
  • ✅ When Docker compatibility is essential
  • ❌ Too slow for rapid iteration

Data Lineage

The generated GraphQL server automatically tracks data lineage in Google Cloud Dataplex:

  • Process: Each resolver is registered as a process
  • Run: Each query execution creates a run (with unique request ID)
  • Lineage Events: Link BigQuery sources to BI report targets
  • Cleanup: Graceful shutdown removes lineage processes

Lineage operations are asynchronous (fire-and-forget) and don't block API responses.

License

Apache 2.0 - See for details

Contributing

Contributions are welcome! Please submit pull requests or open issues for bugs and feature requests.