doc-ingestor by saleemh - MCP Server

doc-ingestor-mcp

An MCP (Model Context Protocol) server that provides intelligent document ingestion capabilities using the Docling toolkit. Convert any document (PDF, DOCX, images, HTML, etc.) into clean Markdown for AI processing and RAG pipelines.

Features

Universal File Support: PDFs, DOCX/XLSX/PPTX, images (PNG/JPEG/TIFF/BMP/WEBP), HTML, Markdown, CSV, audio files, and more
Flexible Input: Process local files or remote URLs
Multiple Processing Pipelines: Standard (fast, high-quality), VLM (vision-language models), ASR (audio transcription)
Intelligent Auto-Detection: Automatically selects optimal settings based on file type and content
Queue Management: Handles concurrent requests with proper job queuing
Mac M2 Optimized: Efficient memory usage and MLX acceleration support
Clean Markdown Output: High-quality structured text ready for AI consumption

Installation

Prerequisites

Python 3.9+ (recommended: 3.11+)
macOS (optimized for Apple Silicon M2)
8GB+ RAM recommended

Setup

Clone and install dependencies:

git clone <repository-url>
cd doc-ingestor-mcp
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Install Docling with Mac optimizations:

# Core Docling with MLX acceleration for Apple Silicon
pip install docling
# For MLX support (Apple Silicon only):
pip install docling[mlx]
# Optional: additional OCR engines
pip install easyocr
# Install tesseract via homebrew: brew install tesseract

Start the MCP server:

python -m doc_ingestor_mcp

The server will start and listen for MCP connections using stdio transport.

MCP Tools

The server provides the following MCP tools:

`convert_document`

Converts any supported document to Markdown.

Parameters:

source (required): File path or URL to the document
pipeline (optional): Processing pipeline - "standard", "vlm", or "asr"
options (optional): Additional processing options

Example:

{
  "name": "convert_document",
  "arguments": {
    "source": "https://arxiv.org/pdf/2408.09869",
    "pipeline": "standard"
  }
}

Response:

{
  "content": [
    {
      "type": "text",
      "text": "# Document Title\n\nConverted markdown content here..."
    }
  ]
}

`convert_document_advanced`

Advanced conversion with detailed configuration options.

Parameters:

source (required): File path or URL
pipeline (optional): "standard", "vlm", "asr"
ocr_enabled (optional): Enable/disable OCR (default: auto-detect)
ocr_language (optional): OCR language codes (e.g., "eng,spa")
table_mode (optional): "fast" or "accurate"
pdf_backend (optional): "dlparse_v4" or "pypdfium2"
enable_enrichments (optional): Enable code/formula/picture enrichments

Example:

{
  "name": "convert_document_advanced",
  "arguments": {
    "source": "./scanned-document.pdf",
    "pipeline": "standard",
    "ocr_enabled": true,
    "ocr_language": "eng",
    "table_mode": "accurate"
  }
}

`get_processing_status`

Check the status of ongoing conversions (useful for large files).

Parameters:

job_id (required): Job identifier returned from conversion requests

`list_supported_formats`

Returns all supported input and output formats.

Response:

{
  "input_formats": ["pdf", "docx", "xlsx", "pptx", "png", "jpeg", "html", "md", "csv", "mp3", "wav"],
  "output_formats": ["markdown", "html", "json", "text", "doctags"],
  "pipelines": ["standard", "vlm", "asr"]
}

Usage Examples

Basic PDF Conversion

{
  "name": "convert_document",
  "arguments": {
    "source": "./research-paper.pdf"
  }
}

URL-based Conversion with VLM Pipeline

{
  "name": "convert_document",
  "arguments": {
    "source": "https://example.com/complex-document.pdf",
    "pipeline": "vlm"
  }
}

Audio Transcription

{
  "name": "convert_document",
  "arguments": {
    "source": "./meeting-recording.mp3",
    "pipeline": "asr"
  }
}

Scanned Document with OCR

{
  "name": "convert_document_advanced",
  "arguments": {
    "source": "./scanned-invoice.pdf",
    "ocr_enabled": true,
    "ocr_language": "eng",
    "table_mode": "accurate"
  }
}

Pipeline Selection Guide

Standard Pipeline (Default)

Best for: Born-digital PDFs, Office documents, clean layouts
Features: Advanced layout analysis, table structure recovery, optional OCR
Performance: Fast, memory-efficient
Use when: Document has programmatic text and standard layouts

VLM Pipeline

Best for: Complex layouts, handwritten notes, screenshots, scanned documents
Features: Vision-language model processing, end-to-end page understanding
Performance: Slower, higher memory usage, MLX-accelerated on M2
Use when: Standard pipeline fails or document has unusual layouts

ASR Pipeline

Best for: Audio files (meetings, lectures, interviews)
Features: Whisper-based transcription, multiple model sizes
Performance: CPU/GPU intensive depending on model size
Use when: Processing audio content

Auto-Detection Logic

The server automatically selects optimal settings:

File Type Detection: Based on extension and content analysis
OCR Decision: Enabled for scanned PDFs and images, disabled for text-based documents
Pipeline Selection: Standard for most documents, VLM suggested for images and complex layouts
Backend Selection: Native parser (dlparse_v4) for quality, pypdfium2 for speed/compatibility

Performance Optimization (Mac M2)

Memory Management

Large Files: Automatic chunking and streaming processing
Queue System: Prevents memory overflow from concurrent requests
Cleanup: Automatic temporary file cleanup after processing

MLX Acceleration

VLM models run with MLX optimization on Apple Silicon
Reduced memory footprint compared to standard PyTorch
Automatic fallback to CPU if MLX unavailable

Configuration

# Environment variables for optimization
export DOCLING_MAX_MEMORY_GB=6        # Limit memory usage
export DOCLING_QUEUE_SIZE=3           # Max concurrent jobs
export DOCLING_ENABLE_MLX=true        # Enable MLX acceleration

Error Handling

Automatic Retry Logic

Network timeouts for URL-based files
Fallback pipelines if primary fails
Alternative OCR engines if primary fails

Error Response Format

{
  "error": {
    "type": "ConversionError",
    "message": "Failed to process document",
    "details": "Specific error information",
    "suggestions": ["Try VLM pipeline", "Enable OCR"]
  }
}

Common Issues & Solutions

Issue	Cause	Solution
Memory error with large PDF	Insufficient RAM	Split document or reduce queue size
Poor OCR quality	Wrong language/engine	Specify language with `ocr_language`
Scrambled text order	PDF parsing issues	Try `"pdf_backend": "pypdfium2"`
Tables not detected	Layout complexity	Use `"table_mode": "accurate"`
Slow processing	Large/complex document	Try `"pipeline": "standard"` first

Integration Examples

Claude Desktop MCP Configuration

Add this to your Claude Desktop configuration file (~/Library/Application Support/Claude/claude_desktop_config.json):

{
  "mcpServers": {
    "doc-ingestor": {
      "command": "python",
      "args": ["-m", "doc_ingestor_mcp"],
      "cwd": "/path/to/doc-ingestor-mcp"
    }
  }
}

Testing the Installation

Test basic functionality:

# Start the server in debug mode
python -m doc_ingestor_mcp --debug

# In another terminal, test with a sample file
echo '{"jsonrpc": "2.0", "id": 1, "method": "tools/call", "params": {"name": "convert_document", "arguments": {"source": "test.pdf"}}}' | python -m doc_ingestor_mcp

Test with Claude Desktop:
- Restart Claude Desktop after adding the MCP configuration
- In a new conversation, try: "Can you convert this PDF to markdown?" and attach a PDF file
- The server should appear in Claude's available tools
Test different file types:

# Test with different pipelines
python test_server.py

Create test_server.py:

import asyncio
import json
from doc_ingestor_mcp.server import DocIngestorMCPServer
from doc_ingestor_mcp.config import load_config

async def test_conversion():
    config = load_config("config.yaml")
    server = DocIngestorMCPServer(config)
    
    # Test basic conversion
    result = await server._handle_convert_document({
        "source": "https://arxiv.org/pdf/2408.09869",
        "pipeline": "standard"
    })
    
    print("Conversion successful!")
    print(f"Output length: {len(result[0].text)} characters")

if __name__ == "__main__":
    asyncio.run(test_conversion())

File Size Limits

PDFs: Up to 500MB (auto-chunked)
Images: Up to 50MB per image
Audio: Up to 2GB (processed in segments)
Office Docs: Up to 200MB
URLs: 10-minute timeout for downloads

Security Considerations

Local Processing: All processing happens locally by default
Remote Services: Optional (disabled by default) for VLM APIs
File Cleanup: Temporary files automatically deleted
URL Validation: Safe URL patterns enforced

Troubleshooting

Debug Mode

python -m doc_ingestor_mcp --debug

Log Analysis

tail -f ./logs/server.log

Run Test Suite

python test_server.py

Common Issues

"ModuleNotFoundError: No module named 'docling'"

pip install docling

"MLX not available" warnings

This is normal on non-Apple Silicon Macs
MLX acceleration is optional and will fallback to CPU

"Queue is full" errors

Wait for current jobs to complete
Increase max_queue_size in config.yaml

"Download failed" for URLs

Check internet connection
Verify URL is accessible
Some sites may block automated downloads

Memory errors with large files

Reduce max_memory_gb in config.yaml
Try smaller files first
Use pipeline: "standard" instead of vlm

OCR not working

Install tesseract: brew install tesseract
Install easyocr: pip install easyocr
Check language settings in config.yaml

Contributing

Fork the repository
Create a feature branch
Add tests for new functionality
Submit a pull request

License

MIT License - see LICENSE file for details.

Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Documentation: Docling Project Docs