jaimeferj/mcp-iceberg
If you are the rightful owner of mcp-iceberg and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.
Apache Iceberg MCP Server is a comprehensive Model Context Protocol server designed to enable LLMs like Claude to efficiently interact with Iceberg data lakes.
Apache Iceberg MCP Server
A comprehensive Model Context Protocol (MCP) server for Apache Iceberg that enables LLMs like Claude to efficiently work with Iceberg data lakes. This server provides essential data exploration, quality checks, metadata operations, and performance optimization tools.
Features
Exploration and Metadata (6 tools)
- list_tables: List all tables in namespaces
- get_schema: Get detailed schema information with types and metadata
- get_partitioning: View partition specifications and transforms
- get_table_properties: Access table configuration and properties
- get_snapshots: List available snapshots for time travel
- get_table_stats: Get basic statistics (row count, file count, size)
Data Quality (6 tools)
- sample_data: Get sample rows with optional random sampling
- get_null_counts: Count null values per column with percentages
- get_distinct_counts: Count distinct values per column
- get_value_distribution: Get top N most frequent values with counts
- check_duplicates: Detect duplicate rows based on specified columns
- get_column_stats: Statistical summary for numeric columns (min, max, avg, std)
Content Analysis (4 tools)
- preview_partitions: Show existing partitions and their sizes
- search_values: Search for rows containing specific values
- get_data_types_summary: Get data type distribution
- validate_schema_evolution: Show schema evolution history
Performance Optimization (3 tools)
- get_file_stats: Get information about data files
- analyze_skew: Detect partition imbalance
- get_table_metadata: Get complete metadata for optimization
Utilities (5 tools)
- execute_query: Execute queries using pandas query syntax
- get_column_names: Get simple list of column names
- check_table_exists: Verify if a table exists
- get_latest_snapshot: Get most recent snapshot details
- filter_preview: Preview data with filters applied
Total: 24 tools
Requirements
- Python 3.12+
- PyIceberg 0.10.0+
- AWS credentials (for Glue catalog)
- Access to Iceberg catalog (Glue, REST, or Hive)
Installation
Using uv (recommended)
# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone the repository
cd mcpIceberg
# Install dependencies
uv sync
Using pip
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
Configuration
Centralized Database Connection
This project uses a single, centralized database/catalog configuration that is shared across all tools and operations. You only need to configure ONE database connection, and all 24 MCP tools will automatically use it.
Key Features:
- ✅ Single point of configuration - no duplicate connection strings
- ✅ Consistent connections across all tools and scripts
- ✅ Easy to switch between environments (dev/staging/prod)
- ✅ Centralized error handling and connection validation
- ✅ Support for multiple catalog types (Glue, REST, Hive)
Environment Variables
The MCP server supports multiple Iceberg catalog types. Configure using environment variables.
Quick Start: Copy the template file and edit with your values:
cp .env.sample .env
# Edit .env with your actual credentials
Test Your Connection:
python test_connection.py
# or
uv run test_connection.py
See .env.sample for a comprehensive list of all available configuration options with detailed explanations.
For detailed information about the centralized configuration architecture, see .
Database Scope Restriction (Security Feature)
You can restrict all operations to a single database for enhanced security:
# .env
ICEBERG_DATABASE=production
When enabled:
- ✅ All operations restricted to specified database only
- ✅ Prevents accidental access to other databases
- ✅ Use simple table names:
usersinstead ofproduction.users - ✅ Cross-database queries automatically blocked
Example:
# With ICEBERG_DATABASE=production
table = load_table("users") # ✓ Works - uses production.users
table = load_table("staging.users") # ✗ Blocked - DatabaseScopeError
For detailed information about database scope, see .
AWS Glue Catalog (default)
export ICEBERG_CATALOG_TYPE=glue
export AWS_REGION=us-east-1
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
REST Catalog
export ICEBERG_CATALOG_TYPE=rest
export ICEBERG_REST_URI=http://localhost:8181
Hive Metastore
export ICEBERG_CATALOG_TYPE=hive
export ICEBERG_HIVE_URI=thrift://localhost:9083
Claude Desktop Configuration
Add to your Claude Desktop configuration file:
macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json
{
"mcpServers": {
"iceberg": {
"command": "uv",
"args": [
"--directory",
"/path/to/mcpIceberg",
"run",
"main.py"
],
"env": {
"ICEBERG_CATALOG_TYPE": "glue",
"AWS_REGION": "us-east-1",
"AWS_ACCESS_KEY_ID": "your_access_key",
"AWS_SECRET_ACCESS_KEY": "your_secret_key"
}
}
}
}
Usage Examples
Basic Exploration
List all available tables:
> Use list_tables
Get schema for a specific table:
> Use get_schema with table_name: "my_database.my_table"
Get basic statistics:
> Use get_table_stats with table_name: "my_database.my_table"
Data Quality Checks
Sample data from a table:
> Use sample_data with table_name: "my_database.my_table", n: 20
Check for null values:
> Use get_null_counts with table_name: "my_database.my_table"
Find duplicates:
> Use check_duplicates with table_name: "my_database.my_table", columns: ["id", "email"]
Get statistical summary:
> Use get_column_stats with table_name: "my_database.my_table"
Advanced Analysis
Get value distribution for a column:
> Use get_value_distribution with table_name: "my_database.my_table", column: "status", top_n: 20
Search for specific values:
> Use search_values with table_name: "my_database.my_table", column: "user_id", value: "12345"
Execute custom queries:
> Use execute_query with table_name: "my_database.my_table", query: "age > 30 and status == 'active'"
Preview with filters:
> Use filter_preview with table_name: "my_database.my_table", filters: "amount > 1000", limit: 50
Performance Analysis
Check partition skew:
> Use analyze_skew with table_name: "my_database.my_table"
Get file statistics:
> Use get_file_stats with table_name: "my_database.my_table"
View partition details:
> Use preview_partitions with table_name: "my_database.my_table"
Metadata Operations
View snapshots for time travel:
> Use get_snapshots with table_name: "my_database.my_table", limit: 20
Check schema evolution:
> Use validate_schema_evolution with table_name: "my_database.my_table"
Get table properties:
> Use get_table_properties with table_name: "my_database.my_table"
Table Name Format
All tools that accept a table_name parameter require a fully qualified name in the format:
namespace.table_name
Examples:
production.usersanalytics.daily_metricsstaging.orders
Response Format
All tools return responses in JSON format. Example response from get_schema:
{
"table": "my_database.my_table",
"schema_id": 1,
"column_count": 5,
"columns": [
{
"name": "id",
"type": "long",
"nullable": false,
"id": 1
},
{
"name": "name",
"type": "string",
"nullable": true,
"id": 2
}
]
}
Error Handling
The server handles errors gracefully and returns informative error messages:
- Table not found: Verify the table name is fully qualified
- Connection errors: Check your catalog configuration and credentials
- Permission errors: Ensure your AWS credentials have appropriate permissions
- Query errors: Verify your query syntax (pandas query format for filters)
Query Syntax
The execute_query and filter_preview tools use pandas query syntax:
Examples:
# Numeric comparisons
"age > 30"
"amount >= 1000 and amount <= 5000"
# String comparisons
"status == 'active'"
"country in ['US', 'UK', 'CA']"
# Boolean operations
"is_verified == True and age > 18"
# Complex conditions
"(age > 30 or vip == True) and status == 'active'"
Performance Considerations
- Sampling: For large tables, use
sample_datawith smallnvalues to avoid loading entire tables - Column selection: When possible, specify columns in tools like
get_null_countsto reduce data transfer - Limits: Most tools have sensible default limits to prevent loading excessive data
- Random sampling: Random sampling loads 10x the requested rows for better randomness
Logging
Logs are written to stderr and include:
- Tool invocations with parameters
- Catalog initialization status
- Error details with stack traces
View logs in the Claude Desktop developer console or check the debug output when running manually.
Development
Running Manually
# With environment variables
export ICEBERG_CATALOG_TYPE=glue
export AWS_REGION=us-east-1
uv run main.py
Testing
# Run tests
uv run pytest tests/
# Run with coverage
uv run pytest --cov=main tests/
Adding New Tools
- Add the async method to the
IcebergMCPServerclass - Add the tool definition to the
get_tools()method - The handler will automatically route calls to your method
- Return a dictionary that will be JSON-serialized
Example:
async def my_new_tool(self, table_name: str, param: str) -> Dict[str, Any]:
"""Tool description"""
try:
table = self._load_table(table_name)
# Your logic here
return {
"table": table_name,
"result": "value"
}
except Exception as e:
logger.error(f"Error in my_new_tool: {e}")
raise
Troubleshooting
MCP server not showing in Claude Desktop
- Check the config file path and JSON syntax
- Restart Claude Desktop completely
- Check Claude Desktop logs for errors
AWS Glue connection issues
- Verify AWS credentials are valid
- Check IAM permissions (need Glue and S3 access)
- Verify the AWS region is correct
- Test credentials with AWS CLI:
aws glue get-databases
Table not found errors
- Verify table name format:
namespace.table_name - Check table exists: use
list_tablesfirst - Verify permissions to access the namespace
Performance issues
- Use smaller sample sizes
- Specify columns when possible to reduce data transfer
- Use partition filters when available
- Consider using
approximateparameter for distinct counts
Architecture
Claude Desktop/CLI
�
MCP Protocol (stdio)
�
IcebergMCPServer
�
PyIceberg Library
�
Iceberg Catalog (Glue/REST/Hive)
�
Iceberg Tables (S3/HDFS/etc)
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
License
This project is open source and available under the MIT License.
Support
For issues and questions:
- GitHub Issues: Create an issue
- Documentation: This README
- PyIceberg Docs: https://py.iceberg.apache.org/
Changelog
v0.1.0 (Initial Release)
- Implemented 24 tools across 5 categories
- Support for Glue, REST, and Hive catalogs
- Comprehensive error handling
- JSON response format
- Full table metadata and schema operations
- Data quality checks and sampling
- Performance optimization tools
- Content analysis and search capabilities
Acknowledgments
- Built with PyIceberg
- Uses the Model Context Protocol