mcp_snowflake_dbt by jwest33 - MCP Server

MCP dbt-Snowflake Server

MCP server for dbt & Snowflake, designed to automate analytics engineering.

Available Tools

Discovery & Profiling

`discover_sources`

Discovers available schemas and tables in Snowflake with pattern matching capabilities.

Parameters: database, schema (optional pattern), like (table pattern)
Returns: List of tables with metadata including row counts and data size
Use Case: Initial exploration to understand available data assets

`profile_table_tool`

Provides comprehensive statistical analysis of a table including column profiles, data samples, and candidate keys.

Parameters: database, schema, table, sample_ratio (0.0-1.0)
Returns: Column statistics, null counts, uniqueness metrics, sample data
Caching: Results cached for 24 hours to optimize performance
Use Case: Deep understanding of table structure before modeling

Key & Join Analysis

`infer_keys_tool`

Identifies primary key candidates using statistical analysis including uniqueness, null ratios, and naming conventions.

Parameters: database, schema, table, sample_ratio
Returns: Ranked list of candidate keys with confidence scores
Algorithm: Combines uniqueness metrics, naming patterns, and data type analysis
Use Case: Automated primary key discovery for modeling

`build_join_map_tool`

Discovers relationships between tables using Jaccard similarity on column values.

Parameters: database, schema, like (table pattern), sample_ratio, threshold (0.0-1.0)
Returns: Join relationships with similarity scores and join types
Caching: Results cached for 7 days due to computational cost
Use Case: Understanding table relationships for dimensional modeling

dbt Project Generation

`propose_model_plan`

Suggests optimal modeling strategy (star schema vs One Big Table) based on table profile and user intent.

Parameters: profile (from profile_table_tool), intent (analytics/reporting)
Returns: Recommended approach with reasoning and layer definitions
Use Case: Strategic planning before model generation

`generate_dbt_project`

Creates complete dbt project structure with profiles, configurations, and documentation.

Parameters: project_dir, project_name, snowflake_config
Creates: Full project structure including profiles.yml, dbt_project.yml, and directories
Use Case: Initial project setup with best practices

`generate_models`

Generates staging, intermediate, and mart models based on profiling and planning.

Parameters: project_dir, source_name, profile_json, plan_json
Creates: SQL models, schema.yml files, and documentation
Layers: Staging (type casting), Intermediate (business logic), Marts (final models)
Use Case: Automated model creation following dbt best practices

`run_dbt_tool`

Executes dbt commands with full audit trail and error handling.

Parameters: project_dir, command (deps/build/test/run/compile)
Returns: Command output, execution status, and logs
Audit: All executions logged to TOOL_EXECUTIONS table
Use Case: Running dbt workflows with monitoring

Interaction Flow

1. Discovery Phase

The LLM begins by discovering available data:

LLM → discover_sources → Snowflake
Returns available tables for analysis

2. Profiling Phase

For selected tables, the LLM profiles to understand structure:

LLM → profile_table_tool → Snowflake
Returns comprehensive statistics and samples

3. Analysis Phase

The LLM identifies keys and relationships:

LLM → infer_keys_tool → Statistical Analysis
LLM → build_join_map_tool → Relationship Discovery

4. Planning Phase

Based on analysis, the LLM proposes modeling approach:

LLM → propose_model_plan → Strategy Recommendation
User reviews and approves approach

5. Generation Phase

The LLM creates the dbt project:

LLM → generate_dbt_project → Project Structure
LLM → generate_models → SQL Models

6. Execution Phase

The LLM runs dbt commands:

LLM → run_dbt_tool → dbt Execution
Returns results and handles errors

Configuration

Environment Variables

# Required - Snowflake Connection
SNOWFLAKE_ACCOUNT=xy12345.us-east-1
SNOWFLAKE_USER=MCP_DBT_USER
SNOWFLAKE_ROLE=MCP_DBT_ROLE
SNOWFLAKE_WAREHOUSE=MCP_DBT_WH

# Required - Target Database
SNOWFLAKE_DATABASE=SALES_DATA      # Database to analyze
SNOWFLAKE_SCHEMA=TRANSACTIONS      # Schema to analyze

# Required - MCP Metadata
MCP_DATABASE=MCP_DBT_SNOWFLAKE    # Metadata storage
MCP_METADATA_SCHEMA=METADATA       # Metadata schema

# Authentication (choose one)
SNOWFLAKE_OAUTH_TOKEN=<token>
SNOWFLAKE_PRIVATE_KEY_PATH=/path/to/key.pem
SNOWFLAKE_PASSWORD=<password>      # Not recommended

# Optional - Performance Tuning
MCP_ENABLE_CACHING=true           # Enable result caching
MCP_SAMPLING_RATIO=0.2            # Default sampling ratio
MCP_LOG_LEVEL=INFO                # Logging verbosity

Quick Setup

Configure environment:

cp .env.example .env
# Edit with your values

Generate Snowflake setup SQL:
```
python scripts/generate_direct_sql.py
```

Run in Snowflake (as ACCOUNTADMIN):

!source scripts/init_snowflake_account_[timestamp].sql

Initialize metadata database:
```
python scripts/setup_mcp_database.py
```
Run server:
```
python main.py
```

Production Features

Resilience

Connection pooling with 1-3 connections and health checks
Retry logic with exponential backoff (max 30s)
Circuit breaker pattern to prevent cascading failures
Graceful degradation when services unavailable

Performance

Result caching with configurable TTL (24h for profiles, 7d for joins)
Adaptive sampling based on table size
Query optimization with pushdown predicates
Parallel operations for multi-table analysis

Security

OAuth/JWT authentication with token management
Role-based access with minimal privileges
Input validation and SQL injection prevention
Audit trail for all operations

Observability

Structured logging with correlation IDs
Query performance tracking in QUERY_PERFORMANCE table
TruLens integration for quality metrics
Health endpoints for monitoring

Metadata Storage

All operations are tracked in the MCP metadata database:

MCP_DBT_SNOWFLAKE.METADATA
├── TOOL_EXECUTIONS        # Audit trail of all tool calls
├── PROFILING_CACHE        # Cached table profiles
├── JOIN_MAP_CACHE         # Cached join relationships
├── DBT_PROJECTS           # Generated project metadata
├── QUERY_PERFORMANCE      # Query execution metrics
└── TRULENS_EVALUATIONS    # Quality assessments

Monitoring Queries

-- Recent tool executions
SELECT tool_name, status, duration_ms, user_name
FROM MCP_DBT_SNOWFLAKE.METADATA.TOOL_EXECUTIONS
ORDER BY started_at DESC LIMIT 20;

-- Cache effectiveness
SELECT
    tool_name,
    COUNT(*) as total_calls,
    SUM(CASE WHEN cache_hit THEN 1 ELSE 0 END) as cache_hits,
    AVG(duration_ms) as avg_duration
FROM MCP_DBT_SNOWFLAKE.METADATA.TOOL_EXECUTIONS
GROUP BY tool_name;

-- Performance by tool
SELECT
    tool_name,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY duration_ms) as p50,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY duration_ms) as p95,
    MAX(duration_ms) as max_duration
FROM MCP_DBT_SNOWFLAKE.METADATA.TOOL_EXECUTIONS
GROUP BY tool_name;

Example Workflow

# 1. Discover available tables
tables = discover_sources(
    database="SALES_DB",
    schema="PUBLIC",
    like="%order%|%customer%"
)

# 2. Profile key tables
order_profile = profile_table_tool(
    database="SALES_DB",
    schema="PUBLIC",
    table="ORDERS",
    sample_ratio=0.2
)

# 3. Identify primary keys
keys = infer_keys_tool(
    database="SALES_DB",
    schema="PUBLIC",
    table="ORDERS",
    sample_ratio=0.2
)

# 4. Find relationships
joins = build_join_map_tool(
    database="SALES_DB",
    schema="PUBLIC",
    like="%order%|%customer%",
    threshold=0.7
)

# 5. Plan modeling approach
plan = propose_model_plan(
    profile=order_profile,
    intent="analytics"
)

# 6. Generate dbt project
generate_dbt_project(
    project_dir="./sales_analytics",
    project_name="sales_analytics"
)

# 7. Create models
generate_models(
    project_dir="./sales_analytics",
    source_name="sales",
    profile_json=order_profile,
    plan_json=plan
)

# 8. Execute dbt
run_dbt_tool(project_dir="./sales_analytics", command="deps")
run_dbt_tool(project_dir="./sales_analytics", command="build")

Work In Progess

Deployment on Snowpark Container Services using Snowflake LLMs

License

MIT