data-mcp-server by stancsz - MCP Server

data-mcp

data-mcp is an AI-first control plane built on the Model Context Protocol (MCP) that enables agentic systems to securely discover, access, transform, and deploy data-driven applications across cloud providers (AWS, GCP). It combines data connectors, infrastructure-as-code, and GitOps to let intelligent agents design, provision, and operate production-ready data pipelines and services while preserving governance and auditability.

Why it matters — concise impact statement:

Accelerates platform engineering by reducing time-to-production for data applications and ML services.
Enables agents to autonomously generate and deploy infrastructure (Terraform) and application manifests (Helm/ArgoCD) with policy and approval gates.
Preserves enterprise controls: service-account based auth, least-privilege operations, audit trails, and policy-as-code.
Lowers operational overhead and enables repeatable, versioned, Git-driven deployments for data pipelines and full-stack AI prototypes.

High-level architecture (mermaid)

%%{init: {'theme':'base', 'themeVariables': {'fontSize':'16px'}, 'flowchart': {'nodeSpacing': 80, 'rankSpacing': 80}}}%%
flowchart LR
  %% Agentic system
  subgraph AgentSystem["Agentic System"]
    AGENT["AI Agent\n(agentic devops)"]
  end

  %% Control plane
  subgraph MCPCluster["data-mcp Control Plane"]
    MCP["data-mcp Server"]
    subgraph Connectors["Connectors"]
      S3["S3"]
      DDB["DynamoDB"]
      BQ["BigQuery"]
    end
    TF["Terraform Modules"]
    GitRepo["Git Repo\n(Helm / Manifests)"]
  end

  %% GitOps & infra
  subgraph GitOps["GitOps / Deployment"]
    ArgoCD["ArgoCD\n(GitOps)"]
  end

  subgraph CloudInfra["Cloud Infrastructure"]
    CLOUD["Cloud Infra\n(AWS / GCP)"]
  end

  %% Kubernetes / runtime
  subgraph Kubernetes["Kubernetes Cluster"]
    K8S["Kubernetes\n(Clusters / Namespaces)"]
    APP["Data Pipelines / ML Services"]
  end

  OBS["Observability\n(Logging / Metrics / Tracing)"]

  %% Flows
  AGENT -->|MCP API| MCP
  MCP -->|read / write| S3
  MCP -->|read / write| DDB
  MCP -->|query| BQ
  MCP -->|generate infra| TF
  TF -->|provision| CLOUD
  MCP -->|commit charts & manifests| GitRepo
  GitRepo -->|reconcile| ArgoCD
  ArgoCD -->|deploy| K8S
  K8S -->|run| APP
  APP -->|emit| OBS

  %% Styling for clarity
  style MCP fill:#f9f,stroke:#333,stroke-width:1px
  style TF fill:#ffefb3,stroke:#333,stroke-width:1px
  style AGENT fill:#e6f7ff,stroke:#333,stroke-width:1px
  style Connectors fill:#ffffff,stroke:#888,stroke-width:1px
  style GitRepo fill:#fff7e6,stroke:#333,stroke-width:1px
  style ArgoCD fill:#f0fff0,stroke:#333,stroke-width:1px
  style K8S fill:#f7f7ff,stroke:#333,stroke-width:1px
  style OBS fill:#fff,stroke:#666,stroke-width:1px

Key goals:

Securely expose AWS and GCP data sources to agentic systems via MCP.
Use service-account based authentication to enable role/attribute-based access control.
Provide built-in agentic capabilities: probe caching (Redis), multi-source data retrieval, transformation and onward delivery.
Support common enterprise data systems: S3, Redshift, DynamoDB, GCS, BigQuery.

Table of contents

Features
Architecture overview
Supported connectors
Security & access control (Service Accounts, RBAC / ABAC)
Agentic capabilities (Redis caching, probe data, pipelines)
Quick start
Configuration examples
Usage patterns & examples
Development notes
Contributing
License

Features

FastMCP-based MCP server for delivering data access functionality to AI agents.
Connectors for:
- AWS: S3, Redshift, DynamoDB
- GCP: GCS, BigQuery
Service Account control:
- Agents authenticate using service account credentials (SA).
- The MCP server relays SA credentials when authorized, enabling secure, role-based and attribute-based access.
- Enterprise-grade control model for least-privilege access.
Agentic abilities:
- Redis-backed caching for probe results, reducing repeated expensive fetches.
- Ability to probe multiple sources, merge/transform results, and write to a specified target.
- Pluggable pipeline stages: fetch -> process -> transform -> deliver.
Audit-friendly: logs and access patterns designed for auditability in enterprise environments.

Architecture overview

Agent (client) connects to the MCP server via the fastmcp protocol.
Agent authenticates using a service account credential (SA token/JSON/key).
MCP validates the SA against local policies and maps it to allowed roles/attributes.
Agent requests data access or pipeline execution.
MCP:
- Optionally checks Redis cache for recent probe results.
- If needed, pulls data from up to two (or more) configured sources (e.g., S3 + BigQuery).
- Runs processing/transform steps (user-defined or built-in).
- Writes/transfers the result to a requested destination (S3/GCS/etc.) if authorized.
MCP returns structured results and stores audit logs.

A high-level diagram (textual): Agent -> fastmcp -> data-mcp server data-mcp server -> [Auth/Policy Engine] -> [Redis cache] -> [Connectors: S3, Redshift, DynamoDB, GCS, BigQuery] -> [Processing/Transforms] -> [Destination]

Supported connectors

AWS
- S3: object storage access
- Redshift: data warehouse queries & exports
- DynamoDB: key/value and document data access
GCP
- GCS: object storage access
- BigQuery: analytics queries & exports

Connectors are implemented as modular adapters so adding other data systems should be straightforward.

Security & Access Control

Service Accounts (SA) are the primary authentication mechanism.
- Agents present SA credentials to the MCP server.
- MCP validates and, if authorized, uses or relays SA credentials to access cloud resources on behalf of the agent.
Role-Based Access Control (RBAC)
- Map SAs to roles that define allowed data operations and resources.
Attribute-Based Access Control (ABAC)
- Policies may consider attributes like environment, project, agent id, time-of-day, and resource tags.
Principle of least privilege
- The MCP server aims to only allow resource access consistent with the SA’s effective permissions and local policy.
Auditing & logging
- All access requests, relayed credential usage, and pipeline operations should be logged for compliance.

Agentic capabilities

Redis caching:
- Probe results can be cached in Redis to reduce duplicate expensive reads.
- Cache keys are derived from probe parameters (sources, query, credentials scope).
Multi-source probes:
- The server can fetch from two different sources, merge or stitch datasets, then process them (join, aggregate, filter).
Processing & Transforms:
- Built-in transforms (filter, aggregation, format change) and a plugin hook for custom processors.
Delivery:
- After processing, results can be sent to a target destination chosen by the agent (S3, GCS, or other sinks) given authorization.

Example flow:

Agent requests a combined probe: S3 (parquet) + BigQuery (query) for a specified time range.
MCP checks Redis for cached result.
If not cached, MCP pulls data from both sources, runs a processing pipeline (sample normalization + aggregation), stores the output to a destination (GCS), and caches the probe result.
MCP returns a response referencing the destination location and metadata.

Quick start (overview)

Prerequisites:

Python 3.8+ (or the language/runtime used by this project)
Access to AWS and/or GCP with appropriate service account credentials.
Redis instance for caching (optional but recommended).
fastmcp installed (this project is built on top of fastmcp).

Basic steps:

Clone this repo: git clone https://github.com/stancsz/data-mcp-server.git
Configure credentials (see next section).
Configure service accounts, policies, and connector settings.
Start the MCP server (example):
- The exact command depends on the project layout (e.g., python -m data_mcp.server or uvicorn data_mcp.app:app --reload). Check the ./bin or ./scripts folder or the project’s main module.
Connect an agent using fastmcp protocol and authenticate with an SA credential.

Note: This README provides the conceptual setup. See the configuration examples below and the repo's config files for concrete examples.

Configuration examples

Environment variables (common):

AWS:
- AWS_ACCESS_KEY_ID
- AWS_SECRET_ACCESS_KEY
- AWS_SESSION_TOKEN (optional)
- AWS_REGION
GCP:
- GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
Redis:
- REDIS_URL=redis://:password@host:6379/0
MCP:
- MCP_PORT=your_port
- MCP_HOST=0.0.0.0

Sample config (YAML) - connectors and policies (example snippet):

connectors:
  s3:
    enabled: true
    default_bucket: my-tenant-bucket
  bigquery:
    enabled: true
    project: my-gcp-project
  redshift:
    enabled: true
  dynamodb:
    enabled: true
  gcs:
    enabled: true

auth:
  service_accounts:
    - id: sa-data-analyst
      allowed_roles: [read-only]
      attributes:
        department: analytics

cache:
  redis_url: redis://localhost:6379/0
  ttl_seconds: 3600

pipeline:
  default_processors:
    - normalize
    - deduplicate
    - aggregate

Service account onboarding:

Upload or register SA credential(s) with the MCP server (secure secret store).
Assign roles and attribute mappings to the SA.
Define per-SA or per-role policies controlling:
- which connectors may be used
- which buckets/tables/datasets are accessible
- whether delivery to external sinks is allowed

Usage patterns & examples

Data probe:
- Agent requests a probe; MCP returns either cached result or initiates a fetch & process.
Cross-source join:
- Pull data from S3 (daily files) and BigQuery (reference table), join in-memory, and write output to GCS.
Export pipeline:
- Agent triggers a pipeline that extracts data from Redshift, transforms it, and writes CSVs to S3 for downstream systems.

Example (pseudocode) request payload

{
  "action": "run_probe",
  "auth": {
    "service_account": "base64-or-path-or-token"
  },
  "sources": [
    { "type": "s3", "bucket": "tenant-bucket", "path": "2025/09/15/*.parquet", "format": "parquet" },
    { "type": "bigquery", "project": "my-project", "query": "SELECT * FROM dataset.table WHERE date = '2025-09-15'" }
  ],
  "processors": [
    { "name": "filter", "params": { "column": "status", "value": "active" } },
    { "name": "aggregate", "params": { "group_by": ["country"], "metrics": ["sum(amount)"] } }
  ],
  "destination": { "type": "gcs", "bucket": "results-bucket", "path": "agent-outputs/2025-09-15/" }
}

Development notes

Connector adapters are modular. To add a new connector:
- Implement a connector class with the standard fetch/query/write interface.
- Register the connector in the server's connector registry.
Processor hooks:
- Processors are stateless functions that accept input data, parameters, and return transformed data.
Caching:
- Cache keys should consider source signature, query text, credentials identity, processors list, and any relevant parameters.
Auditing:
- All actions must include metadata (agent id, SA id, timestamp, target resources) and be logged to a secure append-only store (or exported to your SIEM).

Testing & CI

Unit tests should mock cloud connectors and Redis.
Integration tests can be run against localstack (for AWS) and the BigQuery emulator (or a test GCP project).
CI should validate policy enforcement scenarios (RBAC/ABAC) to ensure no privilege escalation paths.

Contributing

Fork the repo and submit PRs for bug fixes, new connectors, or processor modules.
Ensure tests accompany new features.
Follow semantic commit messages and include CHANGELOG entries for user-facing changes.

Security considerations

Do not store SA credentials in source control.
Use a secure secrets manager for stored credentials (e.g., AWS Secrets Manager, GCP Secret Manager, Vault).
Ensure encrypted transit (TLS) for all MCP and connector communications.
Apply the least privilege principle for any transient credentials issued or relayed by the MCP server.

Roadmap / future ideas

Add fine-grained data masking and field-level access control.
Introduce connectors for additional cloud and on-prem data warehouses.
Add an audit UI for compliance teams to review agent activity and data flows.
Implement policy templates for common enterprise compliance regimes (SOC2, HIPAA, GDPR).
Integrate with ArgoCD for GitOps deployments of Kubernetes manifests and Helm charts.
- The MCP server will be able to author, update, and push Helm charts and Kubernetes manifests to a Git repository watched by ArgoCD.
- ArgoCD will continuously reconcile the Git repo to the target Kubernetes cluster, enabling automated deployment of data pipelines and services.
Provide Terraform automation hooks for provisioning cloud infrastructure (AWS/GCP) programmatically.
- Agents can generate Terraform plans/modules that the MCP server can apply (via a controlled runner) or commit to a repo for CI/CD.
- The server will include opinionated Terraform templates for common infra (VPCs, EKS/GKE clusters, IAM roles, S3/Dynamo/GCS, managed databases).
End-to-end deployment capability:
- The MCP server is capable of provisioning required infra, building and packaging pipeline components (containers/Helm charts), and deploying them to Kubernetes using ArgoCD.
- The MCP server can install the necessary resources (service accounts, RBAC bindings, ArgoCD app manifests, Terraform state backends) to enable seamless agent-driven deployments.
Roadmap priorities:
1. Harden ArgoCD integration: secure Git credentials, app lifecycle management, audit trails for deployments.
2. Terraform runner & safe apply: interactive approvals, plan inspection, and policy gates.
3. Templates & blueprints: ready-made pipeline, ML-serving, and full-stack app blueprints.
4. Observability: bundle logging, metrics, and tracing for deployed pipelines and infra changes.
5. Compliance: policy-as-code integration (e.g., OPA/Gatekeeper) to prevent insecure deployments.

Endgame / Project Purpose

This project's long-term goal is to enable AI agents to rapidly design, provision, and ship production-grade data applications. The "data-mcp" server is intended to be the control-plane that allows an agent to:

Discover and pull data from sources (S3, DynamoDB, BigQuery, etc.).
Assemble data pipelines (transformations, joins, ML features).
Provision infrastructure as code (Terraform) and deploy applications/pipelines to Kubernetes (Helm / ArgoCD).
Manage lifecycle and audits for deployments, ensuring least-privilege operations and traceability.

Agent Examples & Diagrams

Below are example workflows and Mermaid diagrams showing how an AI agent can use this MCP server to build and deploy data apps.

Example 1 — Data pipeline creation + Kubernetes deployment (ArgoCD)

Agent pulls raw data from DynamoDB and S3.
Agent generates a pipeline (container + Helm chart) that performs ETL and writes back results.
Agent commits the Helm chart and manifests to a Git repo watched by ArgoCD.
ArgoCD deploys the Helm chart to the target cluster and keeps it reconciled.

flowchart LR
  Agent["AI Agent"] -->|MCP API: run_probe| MCP["data-mcp Server"]
  MCP -->|fetch| S3["S3 (raw objects)"]
  MCP -->|fetch| Dynamo["DynamoDB (tables)"]
  MCP -->|process| PipelineBuilder["Pipeline Builder (container/helm)"]
  PipelineBuilder -->|commit to git| GitRepo["Git repo (charts/manifests)"]
  GitRepo -->|watch & reconcile| ArgoCD["ArgoCD"]
  ArgoCD -->|deploy| K8s["Kubernetes Cluster"]
  K8s -->|run| ETL["ETL Job (pod/cron)"]
  ETL -->|write| Results["S3 / GCS destination"]
  style MCP fill:#f9f,stroke:#333,stroke-width:1px

Example 2 — Full-stack AI app prototype with Terraform + ML forecast

Agent inspects a table (e.g., orders) in DynamoDB or a dataset in S3 to estimate data needs.
Agent generates Terraform to provision infra: EKS/GKE cluster, managed DB, object storage, IAM roles.
Agent generates application code scaffolding (frontend + backend + ML API) and Terraform/Terragrunt to deploy.
Agent builds ML model (demand forecast) and exposes it via an API in the deployed stack.
The MCP server can apply Terraform (with appropriate approvals) and push the application manifests to ArgoCD for deployment.

flowchart TD
  Agent["AI Agent"] -->|analyze data| MCP["data-mcp Server"]
  MCP -->|read| Dynamo["DynamoDB"]
  MCP -->|read| S3["S3"]
  Agent -->|generate terraform| TF["Terraform module"]
  TF -->|plan & apply| Infra["Cloud infrastructure (VPC, EKS/GKE, DB)"]
  Agent -->|generate app & helm| AppRepo["App repo (frontend/backend/ml-api + helm)"]
  AppRepo -->|commit| GitRepo["Git repo (charts)"]
  GitRepo -->|reconcile| ArgoCD["ArgoCD"]
  ArgoCD -->|deploy| K8s["Kubernetes Cluster"]
  K8s -->|run| App["Full-stack AI App (frontend/backend/ml-api)"]
  App -->|forecast| Drone["Drone Delivery Service (uses forecast)"]
  style TF fill:#ffefb3,stroke:#333,stroke-width:1px

Notes:

These examples are illustrative. Actual implementations should include security checks, policy gates, and manual approval steps where appropriate (especially for Terraform apply).
The repository will include templates/blueprints for common pipelines and full-stack prototypes to accelerate agent workflows.

Contact & support

Repo: https://github.com/stancsz/data-mcp-server
Issues: please open GitHub issues for bugs or feature requests.
For enterprise integrations, include contact details or internal support channels here.

This README is intended to be a guiding blueprint. Update the repository's configuration, examples, and run instructions to match the actual code layout and run scripts used in this project.