pdf-mcp-server by rbd80 - MCP Server

Databricks PDF MCP Server

A hybrid PDF processing solution for Databricks environments that provides both a high-performance native library and MCP server capabilities. Use it as a direct library in notebooks for maximum performance, or as an MCP server for external AI assistant integration.

🚀 Features

Dual Operation Modes

📚 Library Mode: Direct function calls in Databricks notebooks (fastest)
🔌 MCP Server Mode: Standard MCP protocol for external AI assistants

Databricks Native

Unity Catalog Integration: Seamless access to PDFs in Unity Catalog volumes
Spark Memory Management: Automatic memory optimization using cluster resources
Distributed Processing: Leverage Spark executors for large PDF processing
DBFS Support: Access PDFs stored in Databricks File System
Intelligent Caching: Session-aware caching with automatic cleanup
Performance Optimized: Cluster-aware processing strategies

📋 Requirements

Databricks Runtime: 13.0+ (Python 3.9+, Spark 3.4+)
Cluster: Minimum 2 cores, 8GB RAM recommended
Unity Catalog: Optional but recommended for secure file access

🛠 Installation

In Databricks Notebook

%pip install git+https://github.com/example/databricks-pdf-mcp-server.git

As Cluster Library

Go to your cluster configuration
Click "Libraries" → "Install New"
Select "PyPI" and enter: git+https://github.com/example/databricks-pdf-mcp-server.git

Local Development (for testing)

git clone https://github.com/example/databricks-pdf-mcp-server.git
cd databricks-pdf-mcp-server
pip install -e .

🎯 Quick Start

Library Mode (Recommended for Databricks Notebooks)

from databricks_pdf_processor import create_pdf_processor

# Create processor instance
processor = create_pdf_processor()

# Extract text from Unity Catalog volume
text, stats = processor.extract_text("/Volumes/main/default/documents/report.pdf")
print(f"Extracted {stats.pages_processed} pages in {stats.processing_time_seconds:.2f}s")
print(text)

# Search within PDF
results, stats = processor.search_content(
    "/Volumes/main/default/documents/manual.pdf",
    query="installation instructions",
    max_results=5
)

for result in results:
    print(f"Page {result.page_number}: {result.match_text}")
    print(f"Context: ...{result.context_before} [{result.match_text}] {result.context_after}...")

# Get PDF metadata
metadata, stats = processor.get_metadata("/Volumes/main/default/documents/report.pdf")
print(f"Title: {metadata.title}")
print(f"Author: {metadata.author}")
print(f"Pages: {metadata.page_count}")

MCP Server Mode (For External AI Assistants)

Run as MCP Server

# In Databricks terminal or job
python -m databricks_pdf_processor

# Or with uvx
uvx databricks-pdf-mcp-server

MCP Client Configuration

{
  "mcpServers": {
    "databricks-pdf": {
      "command": "python",
      "args": ["-m", "databricks_pdf_processor"],
      "env": {
        "PDF_UNITY_CATALOG_ONLY": "true",
        "PDF_MAX_FILE_SIZE_PCT": "15"
      }
    }
  }
}

Available MCP Tools

extract_text: Extract text from PDF files
search_pdf: Search for content within PDFs
get_pdf_metadata: Get PDF metadata and document information

Advanced Configuration

import os

# Configure via environment variables
os.environ["PDF_MAX_FILE_SIZE_PCT"] = "20"  # 20% of driver memory
os.environ["PDF_CLUSTER_MODE"] = "auto"     # auto, true, false
os.environ["PDF_UNITY_CATALOG_ONLY"] = "true"  # Restrict to Unity Catalog only
os.environ["PDF_CACHE_TTL"] = "7200"        # 2 hours cache

# Or use Databricks widgets for interactive configuration
dbutils.widgets.text("pdf_max_file_size", "200MB", "Max PDF File Size")
dbutils.widgets.dropdown("pdf_unity_catalog_only", "true", ["true", "false"], "Unity Catalog Only")

processor = create_pdf_processor()

Working with Encrypted PDFs

# Extract text from password-protected PDF
text, stats = processor.extract_text(
    "/Volumes/main/secure/confidential.pdf",
    password="secret123"
)

# Search in encrypted PDF
results, stats = processor.search_content(
    "/Volumes/main/secure/confidential.pdf",
    query="confidential data",
    password="secret123"
)

Page Range Processing

# Extract specific pages
text, stats = processor.extract_text(
    "/Volumes/main/default/documents/large_report.pdf",
    page_range="1-10"  # First 10 pages
)

# Extract multiple ranges
text, stats = processor.extract_text(
    "/Volumes/main/default/documents/report.pdf",
    page_range="1-5,10-15,20"  # Pages 1-5, 10-15, and 20
)

🏗 Unity Catalog Setup

1. Create Volume for PDF Storage

-- Create catalog and schema
CREATE CATALOG IF NOT EXISTS main;
CREATE SCHEMA IF NOT EXISTS main.documents;

-- Create volume for PDF files
CREATE VOLUME IF NOT EXISTS main.documents.pdf_files;

2. Grant Permissions

-- Grant read access to users
GRANT READ VOLUME ON VOLUME main.documents.pdf_files TO `user@company.com`;
GRANT READ VOLUME ON VOLUME main.documents.pdf_files TO `data-analysts`;

-- Grant read access to service principal (for jobs)
GRANT READ VOLUME ON VOLUME main.documents.pdf_files TO `service-principal-uuid`;

3. Upload PDF Files

# Upload files to Unity Catalog volume
dbutils.fs.cp(
    "file:/local/path/document.pdf", 
    "/Volumes/main/documents/pdf_files/document.pdf"
)

# List files in volume
dbutils.fs.ls("/Volumes/main/documents/pdf_files/")

⚙️ Configuration Options

Environment Variable	Default	Description
`PDF_MAX_FILE_SIZE_PCT`	15	Max file size as percentage of driver memory
`PDF_CLUSTER_MODE`	auto	Processing mode: auto, true, false
`PDF_UNITY_CATALOG_ONLY`	true	Restrict access to Unity Catalog volumes only
`PDF_CACHE_TTL`	3600	Cache expiration time in seconds
`PDF_MAX_PAGES_PER_PARTITION`	50	Max pages per Spark partition

Widget Configuration

# Set up interactive widgets in notebook
dbutils.widgets.text("pdf_max_file_size", "200MB", "Maximum PDF File Size")
dbutils.widgets.dropdown("pdf_cluster_mode", "auto", ["auto", "true", "false"], "Cluster Processing Mode")
dbutils.widgets.dropdown("pdf_unity_catalog_only", "true", ["true", "false"], "Unity Catalog Only")
dbutils.widgets.text("pdf_cache_ttl", "3600", "Cache TTL (seconds)")

# Widgets are automatically read by the processor
processor = create_pdf_processor()

🔧 Cluster Configuration

Recommended Settings

Small to Medium PDFs (< 50MB):

Driver: 4 cores, 16GB RAM
Workers: 2-4 workers, 2 cores, 8GB RAM each
Runtime: DBR 13.3 LTS or later

Large PDFs (> 50MB):

Driver: 8 cores, 32GB RAM
Workers: 4-8 workers, 4 cores, 16GB RAM each
Runtime: DBR 13.3 LTS or later
Auto-scaling: Enabled

Spark Configuration

# Optimize for PDF processing
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

# Memory settings for large PDFs
spark.conf.set("spark.executor.memory", "8g")
spark.conf.set("spark.executor.memoryFraction", "0.8")
spark.conf.set("spark.driver.maxResultSize", "4g")

📊 Performance Monitoring

Processing Statistics

text, stats = processor.extract_text("/Volumes/main/default/docs/large.pdf")

print(f"File size: {stats.file_size_bytes / 1024 / 1024:.1f} MB")
print(f"Processing time: {stats.processing_time_seconds:.2f} seconds")
print(f"Pages processed: {stats.pages_processed}")
print(f"Used cluster mode: {stats.used_cluster_mode}")
print(f"Cache hit: {stats.cache_hit}")

Cache Management

# Get cache statistics
cache_stats = processor.get_cache_stats()
print(f"Cache entries: {cache_stats['total_entries']}")
print(f"Cache size: {cache_stats['total_size_bytes'] / 1024 / 1024:.1f} MB")

# Clear cache if needed
processor.clear_cache()

🛡 Security Best Practices

Unity Catalog Security

# Use Unity Catalog volumes for secure access
file_path = "/Volumes/main/secure/documents/confidential.pdf"

# Enable Unity Catalog only mode
os.environ["PDF_UNITY_CATALOG_ONLY"] = "true"

Access Control

-- Create dedicated schema for PDF processing
CREATE SCHEMA IF NOT EXISTS main.pdf_processing;

-- Grant minimal required permissions
GRANT USE CATALOG ON CATALOG main TO `pdf-processing-users`;
GRANT USE SCHEMA ON SCHEMA main.pdf_processing TO `pdf-processing-users`;
GRANT READ VOLUME ON VOLUME main.pdf_processing.documents TO `pdf-processing-users`;

Audit Logging

import logging

# Enable detailed logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger('databricks_pdf_processor')

# Processing operations are automatically logged
processor = create_pdf_processor()

🧪 Testing

Unit Tests

# Run tests in Databricks notebook
%pip install pytest

# Run basic tests
%run ./tests/test_databricks_pdf_processor.py

Code Quality Tests

The project includes comprehensive code quality validation tests that check formatting, linting, and type checking:

# Run all code quality tests
python scripts/run_code_quality_tests.py --all

# Run specific test types
python scripts/run_code_quality_tests.py --formatting  # Black formatting
python scripts/run_code_quality_tests.py --linting    # Flake8 linting  
python scripts/run_code_quality_tests.py --typing     # Mypy type checking

# Run with pytest directly
pytest tests/test_code_quality.py -m code_quality     # All quality tests
pytest tests/test_code_quality.py -m formatting       # Only formatting tests
pytest tests/test_code_quality.py -m linting          # Only linting tests
pytest tests/test_code_quality.py -m typing           # Only type checking tests

Code Quality Tools:

Black: Code formatting validation with --check --diff
Flake8: PEP 8 compliance and code quality linting
Mypy: Static type checking for type-annotated files

Configuration Files:

.flake8: Flake8 linting rules and exclusions
mypy.ini: Mypy type checking configuration
pytest.ini: Test markers and quality test integration

Performance Testing

from databricks_pdf_processor.testing import PerformanceTester

# Create performance tester
tester = PerformanceTester()

# Run performance benchmarks
results = tester.run_benchmarks([
    "/Volumes/main/test/small.pdf",
    "/Volumes/main/test/medium.pdf", 
    "/Volumes/main/test/large.pdf"
])

print(f"Average processing time: {results['avg_time']:.2f}s")
print(f"Memory usage: {results['peak_memory']:.1f}MB")

🔍 Troubleshooting

Common Issues

"DatabricksPDFProcessor can only run in Databricks environments"

Ensure you're running in a Databricks notebook or job
Check that Spark session is available

"Access denied to Unity Catalog path"

Verify Unity Catalog permissions with SHOW GRANTS ON VOLUME
Check if the volume path is correct
Ensure user has READ VOLUME permission

"File size exceeds limit"

Increase PDF_MAX_FILE_SIZE_PCT environment variable
Use a cluster with more driver memory
Enable cluster mode for distributed processing

Memory errors with large PDFs

Enable cluster mode: PDF_CLUSTER_MODE=true
Increase cluster driver memory
Process files in smaller page ranges

Debug Mode

import logging
import os

# Enable debug logging
logging.basicConfig(level=logging.DEBUG)
os.environ["PDF_LOG_LEVEL"] = "DEBUG"

processor = create_pdf_processor()

Health Check

# Verify processor setup
processor = create_pdf_processor()

# Check cluster info
cluster_info = processor.databricks_utils.get_cluster_info()
print(f"Runtime version: {cluster_info['runtime_version']}")
print(f"Cluster ID: {cluster_info['cluster_id']}")
print(f"Executor count: {cluster_info['executor_count']}")
print(f"Total cores: {cluster_info['total_cores']}")

# Test Unity Catalog access
test_path = "/Volumes/main/default/"
has_access = processor.databricks_utils.check_unity_catalog_access(test_path)
print(f"Unity Catalog access: {has_access}")

📚 API Reference

DatabricksPDFProcessor

`extract_text(file_path, page_range="all", password=None)`

Extract text content from PDF.

Parameters:

file_path (str): Path to PDF file (Unity Catalog or DBFS)
page_range (str): Page range like "1-5", "1,3,5", or "all"
password (str, optional): Password for encrypted PDFs

Returns:

Tuple[str, ProcessingStats]: Extracted text and processing statistics

`search_content(file_path, query, case_sensitive=False, max_results=10, offset=0, password=None)`

Search for content within PDF.

Parameters:

file_path (str): Path to PDF file
query (str): Search term or phrase
case_sensitive (bool): Whether search is case-sensitive
max_results (int): Maximum results to return
offset (int): Number of results to skip
password (str, optional): Password for encrypted PDFs

Returns:

Tuple[List[SearchResult], ProcessingStats]: Search results and statistics

`get_metadata(file_path, password=None)`

Extract PDF metadata and document information.

Parameters:

file_path (str): Path to PDF file
password (str, optional): Password for encrypted PDFs

Returns:

Tuple[PDFMetadata, ProcessingStats]: Metadata and processing statistics

🤝 Contributing

Fork the repository
Create a feature branch: git checkout -b feature/your-feature
Make changes and add tests
Run tests: pytest tests/
Submit a pull request

📄 License

This project is licensed under the MIT License - see the file for details.

🆘 Support

Documentation:
Issues: GitHub Issues
Discussions: GitHub Discussions

🏷 Version History

v1.0.0: Initial release with core PDF processing capabilities
Focus on Databricks-native implementation
Unity Catalog and DBFS support
Spark integration and distributed processing