rbd80/pdf-mcp-server
If you are the rightful owner of pdf-mcp-server and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.
Databricks PDF MCP Server is a hybrid PDF processing solution designed for Databricks environments, offering both high-performance native library and MCP server capabilities.
Databricks PDF MCP Server
A hybrid PDF processing solution for Databricks environments that provides both a high-performance native library and MCP server capabilities. Use it as a direct library in notebooks for maximum performance, or as an MCP server for external AI assistant integration.
🚀 Features
Dual Operation Modes
- 📚 Library Mode: Direct function calls in Databricks notebooks (fastest)
- 🔌 MCP Server Mode: Standard MCP protocol for external AI assistants
Databricks Native
- Unity Catalog Integration: Seamless access to PDFs in Unity Catalog volumes
- Spark Memory Management: Automatic memory optimization using cluster resources
- Distributed Processing: Leverage Spark executors for large PDF processing
- DBFS Support: Access PDFs stored in Databricks File System
- Intelligent Caching: Session-aware caching with automatic cleanup
- Performance Optimized: Cluster-aware processing strategies
📋 Requirements
- Databricks Runtime: 13.0+ (Python 3.9+, Spark 3.4+)
- Cluster: Minimum 2 cores, 8GB RAM recommended
- Unity Catalog: Optional but recommended for secure file access
🛠 Installation
In Databricks Notebook
%pip install git+https://github.com/example/databricks-pdf-mcp-server.git
As Cluster Library
- Go to your cluster configuration
- Click "Libraries" → "Install New"
- Select "PyPI" and enter:
git+https://github.com/example/databricks-pdf-mcp-server.git
Local Development (for testing)
git clone https://github.com/example/databricks-pdf-mcp-server.git
cd databricks-pdf-mcp-server
pip install -e .
🎯 Quick Start
Library Mode (Recommended for Databricks Notebooks)
from databricks_pdf_processor import create_pdf_processor
# Create processor instance
processor = create_pdf_processor()
# Extract text from Unity Catalog volume
text, stats = processor.extract_text("/Volumes/main/default/documents/report.pdf")
print(f"Extracted {stats.pages_processed} pages in {stats.processing_time_seconds:.2f}s")
print(text)
# Search within PDF
results, stats = processor.search_content(
"/Volumes/main/default/documents/manual.pdf",
query="installation instructions",
max_results=5
)
for result in results:
print(f"Page {result.page_number}: {result.match_text}")
print(f"Context: ...{result.context_before} [{result.match_text}] {result.context_after}...")
# Get PDF metadata
metadata, stats = processor.get_metadata("/Volumes/main/default/documents/report.pdf")
print(f"Title: {metadata.title}")
print(f"Author: {metadata.author}")
print(f"Pages: {metadata.page_count}")
MCP Server Mode (For External AI Assistants)
Run as MCP Server
# In Databricks terminal or job
python -m databricks_pdf_processor
# Or with uvx
uvx databricks-pdf-mcp-server
MCP Client Configuration
{
"mcpServers": {
"databricks-pdf": {
"command": "python",
"args": ["-m", "databricks_pdf_processor"],
"env": {
"PDF_UNITY_CATALOG_ONLY": "true",
"PDF_MAX_FILE_SIZE_PCT": "15"
}
}
}
}
Available MCP Tools
extract_text: Extract text from PDF filessearch_pdf: Search for content within PDFsget_pdf_metadata: Get PDF metadata and document information
Advanced Configuration
import os
# Configure via environment variables
os.environ["PDF_MAX_FILE_SIZE_PCT"] = "20" # 20% of driver memory
os.environ["PDF_CLUSTER_MODE"] = "auto" # auto, true, false
os.environ["PDF_UNITY_CATALOG_ONLY"] = "true" # Restrict to Unity Catalog only
os.environ["PDF_CACHE_TTL"] = "7200" # 2 hours cache
# Or use Databricks widgets for interactive configuration
dbutils.widgets.text("pdf_max_file_size", "200MB", "Max PDF File Size")
dbutils.widgets.dropdown("pdf_unity_catalog_only", "true", ["true", "false"], "Unity Catalog Only")
processor = create_pdf_processor()
Working with Encrypted PDFs
# Extract text from password-protected PDF
text, stats = processor.extract_text(
"/Volumes/main/secure/confidential.pdf",
password="secret123"
)
# Search in encrypted PDF
results, stats = processor.search_content(
"/Volumes/main/secure/confidential.pdf",
query="confidential data",
password="secret123"
)
Page Range Processing
# Extract specific pages
text, stats = processor.extract_text(
"/Volumes/main/default/documents/large_report.pdf",
page_range="1-10" # First 10 pages
)
# Extract multiple ranges
text, stats = processor.extract_text(
"/Volumes/main/default/documents/report.pdf",
page_range="1-5,10-15,20" # Pages 1-5, 10-15, and 20
)
🏗 Unity Catalog Setup
1. Create Volume for PDF Storage
-- Create catalog and schema
CREATE CATALOG IF NOT EXISTS main;
CREATE SCHEMA IF NOT EXISTS main.documents;
-- Create volume for PDF files
CREATE VOLUME IF NOT EXISTS main.documents.pdf_files;
2. Grant Permissions
-- Grant read access to users
GRANT READ VOLUME ON VOLUME main.documents.pdf_files TO `user@company.com`;
GRANT READ VOLUME ON VOLUME main.documents.pdf_files TO `data-analysts`;
-- Grant read access to service principal (for jobs)
GRANT READ VOLUME ON VOLUME main.documents.pdf_files TO `service-principal-uuid`;
3. Upload PDF Files
# Upload files to Unity Catalog volume
dbutils.fs.cp(
"file:/local/path/document.pdf",
"/Volumes/main/documents/pdf_files/document.pdf"
)
# List files in volume
dbutils.fs.ls("/Volumes/main/documents/pdf_files/")
⚙️ Configuration Options
| Environment Variable | Default | Description |
|---|---|---|
PDF_MAX_FILE_SIZE_PCT | 15 | Max file size as percentage of driver memory |
PDF_CLUSTER_MODE | auto | Processing mode: auto, true, false |
PDF_UNITY_CATALOG_ONLY | true | Restrict access to Unity Catalog volumes only |
PDF_CACHE_TTL | 3600 | Cache expiration time in seconds |
PDF_MAX_PAGES_PER_PARTITION | 50 | Max pages per Spark partition |
Widget Configuration
# Set up interactive widgets in notebook
dbutils.widgets.text("pdf_max_file_size", "200MB", "Maximum PDF File Size")
dbutils.widgets.dropdown("pdf_cluster_mode", "auto", ["auto", "true", "false"], "Cluster Processing Mode")
dbutils.widgets.dropdown("pdf_unity_catalog_only", "true", ["true", "false"], "Unity Catalog Only")
dbutils.widgets.text("pdf_cache_ttl", "3600", "Cache TTL (seconds)")
# Widgets are automatically read by the processor
processor = create_pdf_processor()
🔧 Cluster Configuration
Recommended Settings
Small to Medium PDFs (< 50MB):
Driver: 4 cores, 16GB RAM
Workers: 2-4 workers, 2 cores, 8GB RAM each
Runtime: DBR 13.3 LTS or later
Large PDFs (> 50MB):
Driver: 8 cores, 32GB RAM
Workers: 4-8 workers, 4 cores, 16GB RAM each
Runtime: DBR 13.3 LTS or later
Auto-scaling: Enabled
Spark Configuration
# Optimize for PDF processing
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
# Memory settings for large PDFs
spark.conf.set("spark.executor.memory", "8g")
spark.conf.set("spark.executor.memoryFraction", "0.8")
spark.conf.set("spark.driver.maxResultSize", "4g")
📊 Performance Monitoring
Processing Statistics
text, stats = processor.extract_text("/Volumes/main/default/docs/large.pdf")
print(f"File size: {stats.file_size_bytes / 1024 / 1024:.1f} MB")
print(f"Processing time: {stats.processing_time_seconds:.2f} seconds")
print(f"Pages processed: {stats.pages_processed}")
print(f"Used cluster mode: {stats.used_cluster_mode}")
print(f"Cache hit: {stats.cache_hit}")
Cache Management
# Get cache statistics
cache_stats = processor.get_cache_stats()
print(f"Cache entries: {cache_stats['total_entries']}")
print(f"Cache size: {cache_stats['total_size_bytes'] / 1024 / 1024:.1f} MB")
# Clear cache if needed
processor.clear_cache()
🛡 Security Best Practices
Unity Catalog Security
# Use Unity Catalog volumes for secure access
file_path = "/Volumes/main/secure/documents/confidential.pdf"
# Enable Unity Catalog only mode
os.environ["PDF_UNITY_CATALOG_ONLY"] = "true"
Access Control
-- Create dedicated schema for PDF processing
CREATE SCHEMA IF NOT EXISTS main.pdf_processing;
-- Grant minimal required permissions
GRANT USE CATALOG ON CATALOG main TO `pdf-processing-users`;
GRANT USE SCHEMA ON SCHEMA main.pdf_processing TO `pdf-processing-users`;
GRANT READ VOLUME ON VOLUME main.pdf_processing.documents TO `pdf-processing-users`;
Audit Logging
import logging
# Enable detailed logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger('databricks_pdf_processor')
# Processing operations are automatically logged
processor = create_pdf_processor()
🧪 Testing
Unit Tests
# Run tests in Databricks notebook
%pip install pytest
# Run basic tests
%run ./tests/test_databricks_pdf_processor.py
Code Quality Tests
The project includes comprehensive code quality validation tests that check formatting, linting, and type checking:
# Run all code quality tests
python scripts/run_code_quality_tests.py --all
# Run specific test types
python scripts/run_code_quality_tests.py --formatting # Black formatting
python scripts/run_code_quality_tests.py --linting # Flake8 linting
python scripts/run_code_quality_tests.py --typing # Mypy type checking
# Run with pytest directly
pytest tests/test_code_quality.py -m code_quality # All quality tests
pytest tests/test_code_quality.py -m formatting # Only formatting tests
pytest tests/test_code_quality.py -m linting # Only linting tests
pytest tests/test_code_quality.py -m typing # Only type checking tests
Code Quality Tools:
- Black: Code formatting validation with
--check --diff - Flake8: PEP 8 compliance and code quality linting
- Mypy: Static type checking for type-annotated files
Configuration Files:
.flake8: Flake8 linting rules and exclusionsmypy.ini: Mypy type checking configurationpytest.ini: Test markers and quality test integration
Performance Testing
from databricks_pdf_processor.testing import PerformanceTester
# Create performance tester
tester = PerformanceTester()
# Run performance benchmarks
results = tester.run_benchmarks([
"/Volumes/main/test/small.pdf",
"/Volumes/main/test/medium.pdf",
"/Volumes/main/test/large.pdf"
])
print(f"Average processing time: {results['avg_time']:.2f}s")
print(f"Memory usage: {results['peak_memory']:.1f}MB")
🔍 Troubleshooting
Common Issues
"DatabricksPDFProcessor can only run in Databricks environments"
- Ensure you're running in a Databricks notebook or job
- Check that Spark session is available
"Access denied to Unity Catalog path"
- Verify Unity Catalog permissions with
SHOW GRANTS ON VOLUME - Check if the volume path is correct
- Ensure user has READ VOLUME permission
"File size exceeds limit"
- Increase
PDF_MAX_FILE_SIZE_PCTenvironment variable - Use a cluster with more driver memory
- Enable cluster mode for distributed processing
Memory errors with large PDFs
- Enable cluster mode:
PDF_CLUSTER_MODE=true - Increase cluster driver memory
- Process files in smaller page ranges
Debug Mode
import logging
import os
# Enable debug logging
logging.basicConfig(level=logging.DEBUG)
os.environ["PDF_LOG_LEVEL"] = "DEBUG"
processor = create_pdf_processor()
Health Check
# Verify processor setup
processor = create_pdf_processor()
# Check cluster info
cluster_info = processor.databricks_utils.get_cluster_info()
print(f"Runtime version: {cluster_info['runtime_version']}")
print(f"Cluster ID: {cluster_info['cluster_id']}")
print(f"Executor count: {cluster_info['executor_count']}")
print(f"Total cores: {cluster_info['total_cores']}")
# Test Unity Catalog access
test_path = "/Volumes/main/default/"
has_access = processor.databricks_utils.check_unity_catalog_access(test_path)
print(f"Unity Catalog access: {has_access}")
📚 API Reference
DatabricksPDFProcessor
extract_text(file_path, page_range="all", password=None)
Extract text content from PDF.
Parameters:
file_path(str): Path to PDF file (Unity Catalog or DBFS)page_range(str): Page range like "1-5", "1,3,5", or "all"password(str, optional): Password for encrypted PDFs
Returns:
Tuple[str, ProcessingStats]: Extracted text and processing statistics
search_content(file_path, query, case_sensitive=False, max_results=10, offset=0, password=None)
Search for content within PDF.
Parameters:
file_path(str): Path to PDF filequery(str): Search term or phrasecase_sensitive(bool): Whether search is case-sensitivemax_results(int): Maximum results to returnoffset(int): Number of results to skippassword(str, optional): Password for encrypted PDFs
Returns:
Tuple[List[SearchResult], ProcessingStats]: Search results and statistics
get_metadata(file_path, password=None)
Extract PDF metadata and document information.
Parameters:
file_path(str): Path to PDF filepassword(str, optional): Password for encrypted PDFs
Returns:
Tuple[PDFMetadata, ProcessingStats]: Metadata and processing statistics
🤝 Contributing
- Fork the repository
- Create a feature branch:
git checkout -b feature/your-feature - Make changes and add tests
- Run tests:
pytest tests/ - Submit a pull request
📄 License
This project is licensed under the MIT License - see the file for details.
🆘 Support
- Documentation:
- Issues: GitHub Issues
- Discussions: GitHub Discussions
🏷 Version History
- v1.0.0: Initial release with core PDF processing capabilities
- Focus on Databricks-native implementation
- Unity Catalog and DBFS support
- Spark integration and distributed processing