mcp_snowflake_dbt

jwest33/mcp_snowflake_dbt

3.2

If you are the rightful owner of mcp_snowflake_dbt and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

MCP dbt-Snowflake Server is designed to automate analytics engineering by integrating dbt with Snowflake.

Tools
8
Resources
0
Prompts
0

MCP dbt-Snowflake Server

MCP server for dbt & Snowflake, designed to automate analytics engineering.

Available Tools

Discovery & Profiling

discover_sources

Discovers available schemas and tables in Snowflake with pattern matching capabilities.

  • Parameters: database, schema (optional pattern), like (table pattern)
  • Returns: List of tables with metadata including row counts and data size
  • Use Case: Initial exploration to understand available data assets
profile_table_tool

Provides comprehensive statistical analysis of a table including column profiles, data samples, and candidate keys.

  • Parameters: database, schema, table, sample_ratio (0.0-1.0)
  • Returns: Column statistics, null counts, uniqueness metrics, sample data
  • Caching: Results cached for 24 hours to optimize performance
  • Use Case: Deep understanding of table structure before modeling

Key & Join Analysis

infer_keys_tool

Identifies primary key candidates using statistical analysis including uniqueness, null ratios, and naming conventions.

  • Parameters: database, schema, table, sample_ratio
  • Returns: Ranked list of candidate keys with confidence scores
  • Algorithm: Combines uniqueness metrics, naming patterns, and data type analysis
  • Use Case: Automated primary key discovery for modeling
build_join_map_tool

Discovers relationships between tables using Jaccard similarity on column values.

  • Parameters: database, schema, like (table pattern), sample_ratio, threshold (0.0-1.0)
  • Returns: Join relationships with similarity scores and join types
  • Caching: Results cached for 7 days due to computational cost
  • Use Case: Understanding table relationships for dimensional modeling

dbt Project Generation

propose_model_plan

Suggests optimal modeling strategy (star schema vs One Big Table) based on table profile and user intent.

  • Parameters: profile (from profile_table_tool), intent (analytics/reporting)
  • Returns: Recommended approach with reasoning and layer definitions
  • Use Case: Strategic planning before model generation
generate_dbt_project

Creates complete dbt project structure with profiles, configurations, and documentation.

  • Parameters: project_dir, project_name, snowflake_config
  • Creates: Full project structure including profiles.yml, dbt_project.yml, and directories
  • Use Case: Initial project setup with best practices
generate_models

Generates staging, intermediate, and mart models based on profiling and planning.

  • Parameters: project_dir, source_name, profile_json, plan_json
  • Creates: SQL models, schema.yml files, and documentation
  • Layers: Staging (type casting), Intermediate (business logic), Marts (final models)
  • Use Case: Automated model creation following dbt best practices
run_dbt_tool

Executes dbt commands with full audit trail and error handling.

  • Parameters: project_dir, command (deps/build/test/run/compile)
  • Returns: Command output, execution status, and logs
  • Audit: All executions logged to TOOL_EXECUTIONS table
  • Use Case: Running dbt workflows with monitoring

Interaction Flow

1. Discovery Phase

The LLM begins by discovering available data:

LLM → discover_sources → Snowflake
Returns available tables for analysis

2. Profiling Phase

For selected tables, the LLM profiles to understand structure:

LLM → profile_table_tool → Snowflake
Returns comprehensive statistics and samples

3. Analysis Phase

The LLM identifies keys and relationships:

LLM → infer_keys_tool → Statistical Analysis
LLM → build_join_map_tool → Relationship Discovery

4. Planning Phase

Based on analysis, the LLM proposes modeling approach:

LLM → propose_model_plan → Strategy Recommendation
User reviews and approves approach

5. Generation Phase

The LLM creates the dbt project:

LLM → generate_dbt_project → Project Structure
LLM → generate_models → SQL Models

6. Execution Phase

The LLM runs dbt commands:

LLM → run_dbt_tool → dbt Execution
Returns results and handles errors

Configuration

Environment Variables

# Required - Snowflake Connection
SNOWFLAKE_ACCOUNT=xy12345.us-east-1
SNOWFLAKE_USER=MCP_DBT_USER
SNOWFLAKE_ROLE=MCP_DBT_ROLE
SNOWFLAKE_WAREHOUSE=MCP_DBT_WH

# Required - Target Database
SNOWFLAKE_DATABASE=SALES_DATA      # Database to analyze
SNOWFLAKE_SCHEMA=TRANSACTIONS      # Schema to analyze

# Required - MCP Metadata
MCP_DATABASE=MCP_DBT_SNOWFLAKE    # Metadata storage
MCP_METADATA_SCHEMA=METADATA       # Metadata schema

# Authentication (choose one)
SNOWFLAKE_OAUTH_TOKEN=<token>
SNOWFLAKE_PRIVATE_KEY_PATH=/path/to/key.pem
SNOWFLAKE_PASSWORD=<password>      # Not recommended

# Optional - Performance Tuning
MCP_ENABLE_CACHING=true           # Enable result caching
MCP_SAMPLING_RATIO=0.2            # Default sampling ratio
MCP_LOG_LEVEL=INFO                # Logging verbosity

Quick Setup

  1. Configure environment:

    cp .env.example .env
    # Edit with your values
    
  2. Generate Snowflake setup SQL:

    python scripts/generate_direct_sql.py
    
  3. Run in Snowflake (as ACCOUNTADMIN):

    !source scripts/init_snowflake_account_[timestamp].sql
    
  4. Initialize metadata database:

    python scripts/setup_mcp_database.py
    
  5. Run server:

    python main.py
    

Production Features

Resilience

  • Connection pooling with 1-3 connections and health checks
  • Retry logic with exponential backoff (max 30s)
  • Circuit breaker pattern to prevent cascading failures
  • Graceful degradation when services unavailable

Performance

  • Result caching with configurable TTL (24h for profiles, 7d for joins)
  • Adaptive sampling based on table size
  • Query optimization with pushdown predicates
  • Parallel operations for multi-table analysis

Security

  • OAuth/JWT authentication with token management
  • Role-based access with minimal privileges
  • Input validation and SQL injection prevention
  • Audit trail for all operations

Observability

  • Structured logging with correlation IDs
  • Query performance tracking in QUERY_PERFORMANCE table
  • TruLens integration for quality metrics
  • Health endpoints for monitoring

Metadata Storage

All operations are tracked in the MCP metadata database:

MCP_DBT_SNOWFLAKE.METADATA
ā”œā”€ā”€ TOOL_EXECUTIONS        # Audit trail of all tool calls
ā”œā”€ā”€ PROFILING_CACHE        # Cached table profiles
ā”œā”€ā”€ JOIN_MAP_CACHE         # Cached join relationships
ā”œā”€ā”€ DBT_PROJECTS           # Generated project metadata
ā”œā”€ā”€ QUERY_PERFORMANCE      # Query execution metrics
└── TRULENS_EVALUATIONS    # Quality assessments

Monitoring Queries

-- Recent tool executions
SELECT tool_name, status, duration_ms, user_name
FROM MCP_DBT_SNOWFLAKE.METADATA.TOOL_EXECUTIONS
ORDER BY started_at DESC LIMIT 20;

-- Cache effectiveness
SELECT
    tool_name,
    COUNT(*) as total_calls,
    SUM(CASE WHEN cache_hit THEN 1 ELSE 0 END) as cache_hits,
    AVG(duration_ms) as avg_duration
FROM MCP_DBT_SNOWFLAKE.METADATA.TOOL_EXECUTIONS
GROUP BY tool_name;

-- Performance by tool
SELECT
    tool_name,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY duration_ms) as p50,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY duration_ms) as p95,
    MAX(duration_ms) as max_duration
FROM MCP_DBT_SNOWFLAKE.METADATA.TOOL_EXECUTIONS
GROUP BY tool_name;

Example Workflow

# 1. Discover available tables
tables = discover_sources(
    database="SALES_DB",
    schema="PUBLIC",
    like="%order%|%customer%"
)

# 2. Profile key tables
order_profile = profile_table_tool(
    database="SALES_DB",
    schema="PUBLIC",
    table="ORDERS",
    sample_ratio=0.2
)

# 3. Identify primary keys
keys = infer_keys_tool(
    database="SALES_DB",
    schema="PUBLIC",
    table="ORDERS",
    sample_ratio=0.2
)

# 4. Find relationships
joins = build_join_map_tool(
    database="SALES_DB",
    schema="PUBLIC",
    like="%order%|%customer%",
    threshold=0.7
)

# 5. Plan modeling approach
plan = propose_model_plan(
    profile=order_profile,
    intent="analytics"
)

# 6. Generate dbt project
generate_dbt_project(
    project_dir="./sales_analytics",
    project_name="sales_analytics"
)

# 7. Create models
generate_models(
    project_dir="./sales_analytics",
    source_name="sales",
    profile_json=order_profile,
    plan_json=plan
)

# 8. Execute dbt
run_dbt_tool(project_dir="./sales_analytics", command="deps")
run_dbt_tool(project_dir="./sales_analytics", command="build")

Work In Progess

  • Deployment on Snowpark Container Services using Snowflake LLMs

License

MIT