jwest33/mcp_snowflake_dbt
If you are the rightful owner of mcp_snowflake_dbt and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
MCP dbt-Snowflake Server is designed to automate analytics engineering by integrating dbt with Snowflake.
MCP dbt-Snowflake Server
MCP server for dbt & Snowflake, designed to automate analytics engineering.
Available Tools
Discovery & Profiling
discover_sources
Discovers available schemas and tables in Snowflake with pattern matching capabilities.
- Parameters:
database
,schema
(optional pattern),like
(table pattern) - Returns: List of tables with metadata including row counts and data size
- Use Case: Initial exploration to understand available data assets
profile_table_tool
Provides comprehensive statistical analysis of a table including column profiles, data samples, and candidate keys.
- Parameters:
database
,schema
,table
,sample_ratio
(0.0-1.0) - Returns: Column statistics, null counts, uniqueness metrics, sample data
- Caching: Results cached for 24 hours to optimize performance
- Use Case: Deep understanding of table structure before modeling
Key & Join Analysis
infer_keys_tool
Identifies primary key candidates using statistical analysis including uniqueness, null ratios, and naming conventions.
- Parameters:
database
,schema
,table
,sample_ratio
- Returns: Ranked list of candidate keys with confidence scores
- Algorithm: Combines uniqueness metrics, naming patterns, and data type analysis
- Use Case: Automated primary key discovery for modeling
build_join_map_tool
Discovers relationships between tables using Jaccard similarity on column values.
- Parameters:
database
,schema
,like
(table pattern),sample_ratio
,threshold
(0.0-1.0) - Returns: Join relationships with similarity scores and join types
- Caching: Results cached for 7 days due to computational cost
- Use Case: Understanding table relationships for dimensional modeling
dbt Project Generation
propose_model_plan
Suggests optimal modeling strategy (star schema vs One Big Table) based on table profile and user intent.
- Parameters:
profile
(from profile_table_tool),intent
(analytics/reporting) - Returns: Recommended approach with reasoning and layer definitions
- Use Case: Strategic planning before model generation
generate_dbt_project
Creates complete dbt project structure with profiles, configurations, and documentation.
- Parameters:
project_dir
,project_name
,snowflake_config
- Creates: Full project structure including profiles.yml, dbt_project.yml, and directories
- Use Case: Initial project setup with best practices
generate_models
Generates staging, intermediate, and mart models based on profiling and planning.
- Parameters:
project_dir
,source_name
,profile_json
,plan_json
- Creates: SQL models, schema.yml files, and documentation
- Layers: Staging (type casting), Intermediate (business logic), Marts (final models)
- Use Case: Automated model creation following dbt best practices
run_dbt_tool
Executes dbt commands with full audit trail and error handling.
- Parameters:
project_dir
,command
(deps/build/test/run/compile) - Returns: Command output, execution status, and logs
- Audit: All executions logged to TOOL_EXECUTIONS table
- Use Case: Running dbt workflows with monitoring
Interaction Flow
1. Discovery Phase
The LLM begins by discovering available data:
LLM ā discover_sources ā Snowflake
Returns available tables for analysis
2. Profiling Phase
For selected tables, the LLM profiles to understand structure:
LLM ā profile_table_tool ā Snowflake
Returns comprehensive statistics and samples
3. Analysis Phase
The LLM identifies keys and relationships:
LLM ā infer_keys_tool ā Statistical Analysis
LLM ā build_join_map_tool ā Relationship Discovery
4. Planning Phase
Based on analysis, the LLM proposes modeling approach:
LLM ā propose_model_plan ā Strategy Recommendation
User reviews and approves approach
5. Generation Phase
The LLM creates the dbt project:
LLM ā generate_dbt_project ā Project Structure
LLM ā generate_models ā SQL Models
6. Execution Phase
The LLM runs dbt commands:
LLM ā run_dbt_tool ā dbt Execution
Returns results and handles errors
Configuration
Environment Variables
# Required - Snowflake Connection
SNOWFLAKE_ACCOUNT=xy12345.us-east-1
SNOWFLAKE_USER=MCP_DBT_USER
SNOWFLAKE_ROLE=MCP_DBT_ROLE
SNOWFLAKE_WAREHOUSE=MCP_DBT_WH
# Required - Target Database
SNOWFLAKE_DATABASE=SALES_DATA # Database to analyze
SNOWFLAKE_SCHEMA=TRANSACTIONS # Schema to analyze
# Required - MCP Metadata
MCP_DATABASE=MCP_DBT_SNOWFLAKE # Metadata storage
MCP_METADATA_SCHEMA=METADATA # Metadata schema
# Authentication (choose one)
SNOWFLAKE_OAUTH_TOKEN=<token>
SNOWFLAKE_PRIVATE_KEY_PATH=/path/to/key.pem
SNOWFLAKE_PASSWORD=<password> # Not recommended
# Optional - Performance Tuning
MCP_ENABLE_CACHING=true # Enable result caching
MCP_SAMPLING_RATIO=0.2 # Default sampling ratio
MCP_LOG_LEVEL=INFO # Logging verbosity
Quick Setup
-
Configure environment:
cp .env.example .env # Edit with your values
-
Generate Snowflake setup SQL:
python scripts/generate_direct_sql.py
-
Run in Snowflake (as ACCOUNTADMIN):
!source scripts/init_snowflake_account_[timestamp].sql
-
Initialize metadata database:
python scripts/setup_mcp_database.py
-
Run server:
python main.py
Production Features
Resilience
- Connection pooling with 1-3 connections and health checks
- Retry logic with exponential backoff (max 30s)
- Circuit breaker pattern to prevent cascading failures
- Graceful degradation when services unavailable
Performance
- Result caching with configurable TTL (24h for profiles, 7d for joins)
- Adaptive sampling based on table size
- Query optimization with pushdown predicates
- Parallel operations for multi-table analysis
Security
- OAuth/JWT authentication with token management
- Role-based access with minimal privileges
- Input validation and SQL injection prevention
- Audit trail for all operations
Observability
- Structured logging with correlation IDs
- Query performance tracking in QUERY_PERFORMANCE table
- TruLens integration for quality metrics
- Health endpoints for monitoring
Metadata Storage
All operations are tracked in the MCP metadata database:
MCP_DBT_SNOWFLAKE.METADATA
āāā TOOL_EXECUTIONS # Audit trail of all tool calls
āāā PROFILING_CACHE # Cached table profiles
āāā JOIN_MAP_CACHE # Cached join relationships
āāā DBT_PROJECTS # Generated project metadata
āāā QUERY_PERFORMANCE # Query execution metrics
āāā TRULENS_EVALUATIONS # Quality assessments
Monitoring Queries
-- Recent tool executions
SELECT tool_name, status, duration_ms, user_name
FROM MCP_DBT_SNOWFLAKE.METADATA.TOOL_EXECUTIONS
ORDER BY started_at DESC LIMIT 20;
-- Cache effectiveness
SELECT
tool_name,
COUNT(*) as total_calls,
SUM(CASE WHEN cache_hit THEN 1 ELSE 0 END) as cache_hits,
AVG(duration_ms) as avg_duration
FROM MCP_DBT_SNOWFLAKE.METADATA.TOOL_EXECUTIONS
GROUP BY tool_name;
-- Performance by tool
SELECT
tool_name,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY duration_ms) as p50,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY duration_ms) as p95,
MAX(duration_ms) as max_duration
FROM MCP_DBT_SNOWFLAKE.METADATA.TOOL_EXECUTIONS
GROUP BY tool_name;
Example Workflow
# 1. Discover available tables
tables = discover_sources(
database="SALES_DB",
schema="PUBLIC",
like="%order%|%customer%"
)
# 2. Profile key tables
order_profile = profile_table_tool(
database="SALES_DB",
schema="PUBLIC",
table="ORDERS",
sample_ratio=0.2
)
# 3. Identify primary keys
keys = infer_keys_tool(
database="SALES_DB",
schema="PUBLIC",
table="ORDERS",
sample_ratio=0.2
)
# 4. Find relationships
joins = build_join_map_tool(
database="SALES_DB",
schema="PUBLIC",
like="%order%|%customer%",
threshold=0.7
)
# 5. Plan modeling approach
plan = propose_model_plan(
profile=order_profile,
intent="analytics"
)
# 6. Generate dbt project
generate_dbt_project(
project_dir="./sales_analytics",
project_name="sales_analytics"
)
# 7. Create models
generate_models(
project_dir="./sales_analytics",
source_name="sales",
profile_json=order_profile,
plan_json=plan
)
# 8. Execute dbt
run_dbt_tool(project_dir="./sales_analytics", command="deps")
run_dbt_tool(project_dir="./sales_analytics", command="build")
Work In Progess
- Deployment on Snowpark Container Services using Snowflake LLMs
License
MIT