pdf-processor-mcp-server by vladesv - MCP Server

PDF Processing MCP Server

A comprehensive MCP (Model Context Protocol) server for PDF processing that provides text extraction, table extraction, form processing, and metadata retrieval capabilities.

Features

Text Extraction: Extract text from PDF files with optional layout preservation
Table Extraction: Extract tables using pdfplumber
Form Processing: Extract form fields and their values
Metadata Retrieval: Get comprehensive PDF metadata and document information
Security: File path validation, size limits, and directory restrictions
Performance: Fully functional caching and async processing

Installation

Clone the repository:

git clone <repository-url>
cd pdf-processing-mcp

Install dependencies:

pip install -r requirements.txt

Test the installation:

python test_mcp.py

Usage

As MCP Server

The server can be used with Claude Desktop or any MCP-compatible client.

Claude Desktop Configuration

Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "pdf-processor": {
      "command": "python",
      "args": ["-m", "src.server"],
      "cwd": "/path/to/pdf-processing-mcp",
      "env": {
        "PDF_ALLOWED_DIRS": "./documents,./downloads",
        "PDF_CACHE_ENABLED": "true",
        "PDF_MAX_FILE_SIZE": "104857600"
      }
    }
  }
}

Running the Server

python -m src.server

Available Tools

extract_pdf_text
- Extract text from PDF files
- Parameters: file_path, pages (optional), preserve_layout (optional)
extract_pdf_tables
- Extract tables from PDF files
- Parameters: file_path, pages (optional), method (optional)
extract_pdf_forms
- Extract form fields and values
- Parameters: file_path, include_values (optional)
get_pdf_metadata
- Get PDF metadata and document information
- Parameters: file_path
process_pdf_attachment ⭐ NEW
- Process PDF attachments directly from Claude Desktop chat
- Parameters: pdf_data (base64), extraction_type (optional), options (optional)
- Extraction types: text, tables, forms, metadata, all

Project Structure

pdf-processing-mcp/
├── README.md                  # This file
├── requirements.txt           # Dependencies
├── pyproject.toml            # Python project configuration
├── test_mcp.py               # Test suite
├── .env.example              # Environment configuration template
└── src/                      # Source code
    ├── server.py             # Main MCP server
    ├── config.py             # Configuration management
    ├── exceptions.py         # Custom exceptions
    ├── tools/                # PDF processing tools
    │   ├── text_extractor.py
    │   ├── table_extractor.py
    │   ├── form_extractor.py
    │   ├── metadata_extractor.py
    │   └── attachment_processor.py  # NEW: Process PDF attachments
    └── utils/                # Utility modules
        ├── security.py
        └── caching.py

Configuration

Environment Variables

PDF_ALLOWED_DIRS: Comma-separated list of allowed directories
PDF_MAX_FILE_SIZE: Maximum file size in bytes (default: 100MB)
PDF_CACHE_ENABLED: Enable/disable caching (default: true)
PDF_DEBUG: Enable debug mode (default: false)

Security

File path validation prevents directory traversal attacks
Size limits prevent processing of overly large files
Directory restrictions limit access to specified paths
File type validation ensures only PDF files are processed

Testing

Run the comprehensive test suite:

python test_mcp.py

All tests should pass with "Cache working: True" confirming the caching functionality is operational.

Dependencies

Core: mcp (Model Context Protocol)
PDF Processing: PyMuPDF, pdfplumber, pandas
Utilities: aiofiles, cachetools
Development: pytest, black, mypy

License

MIT License

Notes

Optimized for Windows compatibility
Camelot and python-magic dependencies removed due to compatibility issues
Uses PyMuPDF for fast text extraction and pdfplumber for table extraction
Comprehensive error handling and logging
Fixed: Cache boolean evaluation bug - caching now works correctly
NEW: Support for processing PDF attachments directly from Claude Desktop chat
Ready for Claude Desktop integration with attachment support