saleemh/doc-ingestor
If you are the rightful owner of doc-ingestor and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
The doc-ingestor-mcp is a Model Context Protocol server designed to convert various document formats into clean Markdown, optimized for AI processing and RAG pipelines.
doc-ingestor-mcp
An MCP (Model Context Protocol) server that provides intelligent document ingestion capabilities using the Docling toolkit. Convert any document (PDF, DOCX, images, HTML, etc.) into clean Markdown for AI processing and RAG pipelines.
Features
- Universal File Support: PDFs, DOCX/XLSX/PPTX, images (PNG/JPEG/TIFF/BMP/WEBP), HTML, Markdown, CSV, audio files, and more
- Flexible Input: Process local files or remote URLs
- Multiple Processing Pipelines: Standard (fast, high-quality), VLM (vision-language models), ASR (audio transcription)
- Intelligent Auto-Detection: Automatically selects optimal settings based on file type and content
- Queue Management: Handles concurrent requests with proper job queuing
- Mac M2 Optimized: Efficient memory usage and MLX acceleration support
- Clean Markdown Output: High-quality structured text ready for AI consumption
Installation
Prerequisites
- Python 3.9+ (recommended: 3.11+)
- macOS (optimized for Apple Silicon M2)
- 8GB+ RAM recommended
Setup
- Clone and install dependencies:
git clone <repository-url>
cd doc-ingestor-mcp
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
- Install Docling with Mac optimizations:
# Core Docling with MLX acceleration for Apple Silicon
pip install docling
# For MLX support (Apple Silicon only):
pip install docling[mlx]
# Optional: additional OCR engines
pip install easyocr
# Install tesseract via homebrew: brew install tesseract
- Start the MCP server:
python -m doc_ingestor_mcp
The server will start and listen for MCP connections using stdio transport.
MCP Tools
The server provides the following MCP tools:
convert_document
Converts any supported document to Markdown.
Parameters:
source
(required): File path or URL to the documentpipeline
(optional): Processing pipeline -"standard"
,"vlm"
, or"asr"
options
(optional): Additional processing options
Example:
{
"name": "convert_document",
"arguments": {
"source": "https://arxiv.org/pdf/2408.09869",
"pipeline": "standard"
}
}
Response:
{
"content": [
{
"type": "text",
"text": "# Document Title\n\nConverted markdown content here..."
}
]
}
convert_document_advanced
Advanced conversion with detailed configuration options.
Parameters:
source
(required): File path or URLpipeline
(optional):"standard"
,"vlm"
,"asr"
ocr_enabled
(optional): Enable/disable OCR (default: auto-detect)ocr_language
(optional): OCR language codes (e.g., "eng,spa")table_mode
(optional):"fast"
or"accurate"
pdf_backend
(optional):"dlparse_v4"
or"pypdfium2"
enable_enrichments
(optional): Enable code/formula/picture enrichments
Example:
{
"name": "convert_document_advanced",
"arguments": {
"source": "./scanned-document.pdf",
"pipeline": "standard",
"ocr_enabled": true,
"ocr_language": "eng",
"table_mode": "accurate"
}
}
get_processing_status
Check the status of ongoing conversions (useful for large files).
Parameters:
job_id
(required): Job identifier returned from conversion requests
list_supported_formats
Returns all supported input and output formats.
Response:
{
"input_formats": ["pdf", "docx", "xlsx", "pptx", "png", "jpeg", "html", "md", "csv", "mp3", "wav"],
"output_formats": ["markdown", "html", "json", "text", "doctags"],
"pipelines": ["standard", "vlm", "asr"]
}
Usage Examples
Basic PDF Conversion
{
"name": "convert_document",
"arguments": {
"source": "./research-paper.pdf"
}
}
URL-based Conversion with VLM Pipeline
{
"name": "convert_document",
"arguments": {
"source": "https://example.com/complex-document.pdf",
"pipeline": "vlm"
}
}
Audio Transcription
{
"name": "convert_document",
"arguments": {
"source": "./meeting-recording.mp3",
"pipeline": "asr"
}
}
Scanned Document with OCR
{
"name": "convert_document_advanced",
"arguments": {
"source": "./scanned-invoice.pdf",
"ocr_enabled": true,
"ocr_language": "eng",
"table_mode": "accurate"
}
}
Pipeline Selection Guide
Standard Pipeline (Default)
- Best for: Born-digital PDFs, Office documents, clean layouts
- Features: Advanced layout analysis, table structure recovery, optional OCR
- Performance: Fast, memory-efficient
- Use when: Document has programmatic text and standard layouts
VLM Pipeline
- Best for: Complex layouts, handwritten notes, screenshots, scanned documents
- Features: Vision-language model processing, end-to-end page understanding
- Performance: Slower, higher memory usage, MLX-accelerated on M2
- Use when: Standard pipeline fails or document has unusual layouts
ASR Pipeline
- Best for: Audio files (meetings, lectures, interviews)
- Features: Whisper-based transcription, multiple model sizes
- Performance: CPU/GPU intensive depending on model size
- Use when: Processing audio content
Auto-Detection Logic
The server automatically selects optimal settings:
- File Type Detection: Based on extension and content analysis
- OCR Decision: Enabled for scanned PDFs and images, disabled for text-based documents
- Pipeline Selection: Standard for most documents, VLM suggested for images and complex layouts
- Backend Selection: Native parser (dlparse_v4) for quality, pypdfium2 for speed/compatibility
Performance Optimization (Mac M2)
Memory Management
- Large Files: Automatic chunking and streaming processing
- Queue System: Prevents memory overflow from concurrent requests
- Cleanup: Automatic temporary file cleanup after processing
MLX Acceleration
- VLM models run with MLX optimization on Apple Silicon
- Reduced memory footprint compared to standard PyTorch
- Automatic fallback to CPU if MLX unavailable
Configuration
# Environment variables for optimization
export DOCLING_MAX_MEMORY_GB=6 # Limit memory usage
export DOCLING_QUEUE_SIZE=3 # Max concurrent jobs
export DOCLING_ENABLE_MLX=true # Enable MLX acceleration
Error Handling
Automatic Retry Logic
- Network timeouts for URL-based files
- Fallback pipelines if primary fails
- Alternative OCR engines if primary fails
Error Response Format
{
"error": {
"type": "ConversionError",
"message": "Failed to process document",
"details": "Specific error information",
"suggestions": ["Try VLM pipeline", "Enable OCR"]
}
}
Common Issues & Solutions
Issue | Cause | Solution |
---|---|---|
Memory error with large PDF | Insufficient RAM | Split document or reduce queue size |
Poor OCR quality | Wrong language/engine | Specify language with ocr_language |
Scrambled text order | PDF parsing issues | Try "pdf_backend": "pypdfium2" |
Tables not detected | Layout complexity | Use "table_mode": "accurate" |
Slow processing | Large/complex document | Try "pipeline": "standard" first |
Integration Examples
Claude Desktop MCP Configuration
Add this to your Claude Desktop configuration file (~/Library/Application Support/Claude/claude_desktop_config.json
):
{
"mcpServers": {
"doc-ingestor": {
"command": "python",
"args": ["-m", "doc_ingestor_mcp"],
"cwd": "/path/to/doc-ingestor-mcp"
}
}
}
Testing the Installation
- Test basic functionality:
# Start the server in debug mode
python -m doc_ingestor_mcp --debug
# In another terminal, test with a sample file
echo '{"jsonrpc": "2.0", "id": 1, "method": "tools/call", "params": {"name": "convert_document", "arguments": {"source": "test.pdf"}}}' | python -m doc_ingestor_mcp
-
Test with Claude Desktop:
- Restart Claude Desktop after adding the MCP configuration
- In a new conversation, try: "Can you convert this PDF to markdown?" and attach a PDF file
- The server should appear in Claude's available tools
-
Test different file types:
# Test with different pipelines
python test_server.py
Create test_server.py
:
import asyncio
import json
from doc_ingestor_mcp.server import DocIngestorMCPServer
from doc_ingestor_mcp.config import load_config
async def test_conversion():
config = load_config("config.yaml")
server = DocIngestorMCPServer(config)
# Test basic conversion
result = await server._handle_convert_document({
"source": "https://arxiv.org/pdf/2408.09869",
"pipeline": "standard"
})
print("Conversion successful!")
print(f"Output length: {len(result[0].text)} characters")
if __name__ == "__main__":
asyncio.run(test_conversion())
File Size Limits
- PDFs: Up to 500MB (auto-chunked)
- Images: Up to 50MB per image
- Audio: Up to 2GB (processed in segments)
- Office Docs: Up to 200MB
- URLs: 10-minute timeout for downloads
Security Considerations
- Local Processing: All processing happens locally by default
- Remote Services: Optional (disabled by default) for VLM APIs
- File Cleanup: Temporary files automatically deleted
- URL Validation: Safe URL patterns enforced
Troubleshooting
Debug Mode
python -m doc_ingestor_mcp --debug
Log Analysis
tail -f ./logs/server.log
Run Test Suite
python test_server.py
Common Issues
"ModuleNotFoundError: No module named 'docling'"
pip install docling
"MLX not available" warnings
- This is normal on non-Apple Silicon Macs
- MLX acceleration is optional and will fallback to CPU
"Queue is full" errors
- Wait for current jobs to complete
- Increase
max_queue_size
in config.yaml
"Download failed" for URLs
- Check internet connection
- Verify URL is accessible
- Some sites may block automated downloads
Memory errors with large files
- Reduce
max_memory_gb
in config.yaml - Try smaller files first
- Use
pipeline: "standard"
instead ofvlm
OCR not working
- Install tesseract:
brew install tesseract
- Install easyocr:
pip install easyocr
- Check language settings in config.yaml
Contributing
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Submit a pull request
License
MIT License - see LICENSE file for details.
Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: Docling Project Docs