mcp-iceberg by jaimeferj - MCP Server

Apache Iceberg MCP Server

A comprehensive Model Context Protocol (MCP) server for Apache Iceberg that enables LLMs like Claude to efficiently work with Iceberg data lakes. This server provides essential data exploration, quality checks, metadata operations, and performance optimization tools.

Features

Exploration and Metadata (6 tools)

list_tables: List all tables in namespaces
get_schema: Get detailed schema information with types and metadata
get_partitioning: View partition specifications and transforms
get_table_properties: Access table configuration and properties
get_snapshots: List available snapshots for time travel
get_table_stats: Get basic statistics (row count, file count, size)

Data Quality (6 tools)

sample_data: Get sample rows with optional random sampling
get_null_counts: Count null values per column with percentages
get_distinct_counts: Count distinct values per column
get_value_distribution: Get top N most frequent values with counts
check_duplicates: Detect duplicate rows based on specified columns
get_column_stats: Statistical summary for numeric columns (min, max, avg, std)

Content Analysis (4 tools)

preview_partitions: Show existing partitions and their sizes
search_values: Search for rows containing specific values
get_data_types_summary: Get data type distribution
validate_schema_evolution: Show schema evolution history

Performance Optimization (3 tools)

get_file_stats: Get information about data files
analyze_skew: Detect partition imbalance
get_table_metadata: Get complete metadata for optimization

Utilities (5 tools)

execute_query: Execute queries using pandas query syntax
get_column_names: Get simple list of column names
check_table_exists: Verify if a table exists
get_latest_snapshot: Get most recent snapshot details
filter_preview: Preview data with filters applied

Total: 24 tools

Requirements

Python 3.12+
PyIceberg 0.10.0+
AWS credentials (for Glue catalog)
Access to Iceberg catalog (Glue, REST, or Hive)

Installation

Using uv (recommended)

# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone the repository
cd mcpIceberg

# Install dependencies
uv sync

Using pip

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Configuration

Centralized Database Connection

This project uses a single, centralized database/catalog configuration that is shared across all tools and operations. You only need to configure ONE database connection, and all 24 MCP tools will automatically use it.

Key Features:

✅ Single point of configuration - no duplicate connection strings
✅ Consistent connections across all tools and scripts
✅ Easy to switch between environments (dev/staging/prod)
✅ Centralized error handling and connection validation
✅ Support for multiple catalog types (Glue, REST, Hive)

Environment Variables

The MCP server supports multiple Iceberg catalog types. Configure using environment variables.

Quick Start: Copy the template file and edit with your values:

cp .env.sample .env
# Edit .env with your actual credentials

Test Your Connection:

python test_connection.py
# or
uv run test_connection.py

See .env.sample for a comprehensive list of all available configuration options with detailed explanations.

For detailed information about the centralized configuration architecture, see .

Database Scope Restriction (Security Feature)

You can restrict all operations to a single database for enhanced security:

# .env
ICEBERG_DATABASE=production

When enabled:

✅ All operations restricted to specified database only
✅ Prevents accidental access to other databases
✅ Use simple table names: users instead of production.users
✅ Cross-database queries automatically blocked

Example:

# With ICEBERG_DATABASE=production
table = load_table("users")              # ✓ Works - uses production.users
table = load_table("staging.users")      # ✗ Blocked - DatabaseScopeError

For detailed information about database scope, see .

AWS Glue Catalog (default)

export ICEBERG_CATALOG_TYPE=glue
export AWS_REGION=us-east-1
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key

REST Catalog

export ICEBERG_CATALOG_TYPE=rest
export ICEBERG_REST_URI=http://localhost:8181

Hive Metastore

export ICEBERG_CATALOG_TYPE=hive
export ICEBERG_HIVE_URI=thrift://localhost:9083

Claude Desktop Configuration

Add to your Claude Desktop configuration file:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json Windows: %APPDATA%\Claude\claude_desktop_config.json

{
  "mcpServers": {
    "iceberg": {
      "command": "uv",
      "args": [
        "--directory",
        "/path/to/mcpIceberg",
        "run",
        "main.py"
      ],
      "env": {
        "ICEBERG_CATALOG_TYPE": "glue",
        "AWS_REGION": "us-east-1",
        "AWS_ACCESS_KEY_ID": "your_access_key",
        "AWS_SECRET_ACCESS_KEY": "your_secret_key"
      }
    }
  }
}

Usage Examples

Basic Exploration

List all available tables:
> Use list_tables

Get schema for a specific table:
> Use get_schema with table_name: "my_database.my_table"

Get basic statistics:
> Use get_table_stats with table_name: "my_database.my_table"

Data Quality Checks

Sample data from a table:
> Use sample_data with table_name: "my_database.my_table", n: 20

Check for null values:
> Use get_null_counts with table_name: "my_database.my_table"

Find duplicates:
> Use check_duplicates with table_name: "my_database.my_table", columns: ["id", "email"]

Get statistical summary:
> Use get_column_stats with table_name: "my_database.my_table"

Advanced Analysis

Get value distribution for a column:
> Use get_value_distribution with table_name: "my_database.my_table", column: "status", top_n: 20

Search for specific values:
> Use search_values with table_name: "my_database.my_table", column: "user_id", value: "12345"

Execute custom queries:
> Use execute_query with table_name: "my_database.my_table", query: "age > 30 and status == 'active'"

Preview with filters:
> Use filter_preview with table_name: "my_database.my_table", filters: "amount > 1000", limit: 50

Performance Analysis

Check partition skew:
> Use analyze_skew with table_name: "my_database.my_table"

Get file statistics:
> Use get_file_stats with table_name: "my_database.my_table"

View partition details:
> Use preview_partitions with table_name: "my_database.my_table"

Metadata Operations

View snapshots for time travel:
> Use get_snapshots with table_name: "my_database.my_table", limit: 20

Check schema evolution:
> Use validate_schema_evolution with table_name: "my_database.my_table"

Get table properties:
> Use get_table_properties with table_name: "my_database.my_table"

Table Name Format

All tools that accept a table_name parameter require a fully qualified name in the format:

namespace.table_name

Examples:

production.users
analytics.daily_metrics
staging.orders

Response Format

All tools return responses in JSON format. Example response from get_schema:

{
  "table": "my_database.my_table",
  "schema_id": 1,
  "column_count": 5,
  "columns": [
    {
      "name": "id",
      "type": "long",
      "nullable": false,
      "id": 1
    },
    {
      "name": "name",
      "type": "string",
      "nullable": true,
      "id": 2
    }
  ]
}

Error Handling

The server handles errors gracefully and returns informative error messages:

Table not found: Verify the table name is fully qualified
Connection errors: Check your catalog configuration and credentials
Permission errors: Ensure your AWS credentials have appropriate permissions
Query errors: Verify your query syntax (pandas query format for filters)

Query Syntax

The execute_query and filter_preview tools use pandas query syntax:

Examples:

# Numeric comparisons
"age > 30"
"amount >= 1000 and amount <= 5000"

# String comparisons
"status == 'active'"
"country in ['US', 'UK', 'CA']"

# Boolean operations
"is_verified == True and age > 18"

# Complex conditions
"(age > 30 or vip == True) and status == 'active'"

Performance Considerations

Sampling: For large tables, use sample_data with small n values to avoid loading entire tables
Column selection: When possible, specify columns in tools like get_null_counts to reduce data transfer
Limits: Most tools have sensible default limits to prevent loading excessive data
Random sampling: Random sampling loads 10x the requested rows for better randomness

Logging

Logs are written to stderr and include:

Tool invocations with parameters
Catalog initialization status
Error details with stack traces

View logs in the Claude Desktop developer console or check the debug output when running manually.

Development

Running Manually

# With environment variables
export ICEBERG_CATALOG_TYPE=glue
export AWS_REGION=us-east-1

uv run main.py

Testing

# Run tests
uv run pytest tests/

# Run with coverage
uv run pytest --cov=main tests/

Adding New Tools

Add the async method to the IcebergMCPServer class
Add the tool definition to the get_tools() method
The handler will automatically route calls to your method
Return a dictionary that will be JSON-serialized

Example:

async def my_new_tool(self, table_name: str, param: str) -> Dict[str, Any]:
    """Tool description"""
    try:
        table = self._load_table(table_name)
        # Your logic here
        return {
            "table": table_name,
            "result": "value"
        }
    except Exception as e:
        logger.error(f"Error in my_new_tool: {e}")
        raise

Troubleshooting

MCP server not showing in Claude Desktop

Check the config file path and JSON syntax
Restart Claude Desktop completely
Check Claude Desktop logs for errors

AWS Glue connection issues

Verify AWS credentials are valid
Check IAM permissions (need Glue and S3 access)
Verify the AWS region is correct
Test credentials with AWS CLI: aws glue get-databases

Table not found errors

Verify table name format: namespace.table_name
Check table exists: use list_tables first
Verify permissions to access the namespace

Performance issues

Use smaller sample sizes
Specify columns when possible to reduce data transfer
Use partition filters when available
Consider using approximate parameter for distinct counts

Architecture

Claude Desktop/CLI
    �
MCP Protocol (stdio)
    �
IcebergMCPServer
    �
PyIceberg Library
    �
Iceberg Catalog (Glue/REST/Hive)
    �
Iceberg Tables (S3/HDFS/etc)

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Add tests for new functionality
Ensure all tests pass
Submit a pull request

License

This project is open source and available under the MIT License.

Support

For issues and questions:

GitHub Issues: Create an issue
Documentation: This README
PyIceberg Docs: https://py.iceberg.apache.org/

Changelog

v0.1.0 (Initial Release)

Implemented 24 tools across 5 categories
Support for Glue, REST, and Hive catalogs
Comprehensive error handling
JSON response format
Full table metadata and schema operations
Data quality checks and sampling
Performance optimization tools
Content analysis and search capabilities

Acknowledgments

Built with PyIceberg
Uses the Model Context Protocol