pdf-processor-mcp-server

vladesv/pdf-processor-mcp-server

3.2

If you are the rightful owner of pdf-processor-mcp-server and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

A comprehensive MCP server for PDF processing that provides text extraction, table extraction, form processing, and metadata retrieval capabilities.

Tools
  1. extract_pdf_text

    Extract text from PDF files.

  2. extract_pdf_tables

    Extract tables from PDF files.

  3. extract_pdf_forms

    Extract form fields and values.

  4. get_pdf_metadata

    Get PDF metadata and document information.

  5. process_pdf_attachment

    Process PDF attachments directly from Claude Desktop chat.

PDF Processing MCP Server

A comprehensive MCP (Model Context Protocol) server for PDF processing that provides text extraction, table extraction, form processing, and metadata retrieval capabilities.

Features

  • Text Extraction: Extract text from PDF files with optional layout preservation
  • Table Extraction: Extract tables using pdfplumber
  • Form Processing: Extract form fields and their values
  • Metadata Retrieval: Get comprehensive PDF metadata and document information
  • Security: File path validation, size limits, and directory restrictions
  • Performance: Fully functional caching and async processing

Installation

  1. Clone the repository:
git clone <repository-url>
cd pdf-processing-mcp
  1. Install dependencies:
pip install -r requirements.txt
  1. Test the installation:
python test_mcp.py

Usage

As MCP Server

The server can be used with Claude Desktop or any MCP-compatible client.

Claude Desktop Configuration

Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "pdf-processor": {
      "command": "python",
      "args": ["-m", "src.server"],
      "cwd": "/path/to/pdf-processing-mcp",
      "env": {
        "PDF_ALLOWED_DIRS": "./documents,./downloads",
        "PDF_CACHE_ENABLED": "true",
        "PDF_MAX_FILE_SIZE": "104857600"
      }
    }
  }
}
Running the Server
python -m src.server

Available Tools

  1. extract_pdf_text

    • Extract text from PDF files
    • Parameters: file_path, pages (optional), preserve_layout (optional)
  2. extract_pdf_tables

    • Extract tables from PDF files
    • Parameters: file_path, pages (optional), method (optional)
  3. extract_pdf_forms

    • Extract form fields and values
    • Parameters: file_path, include_values (optional)
  4. get_pdf_metadata

    • Get PDF metadata and document information
    • Parameters: file_path
  5. process_pdf_attachment ⭐ NEW

    • Process PDF attachments directly from Claude Desktop chat
    • Parameters: pdf_data (base64), extraction_type (optional), options (optional)
    • Extraction types: text, tables, forms, metadata, all

Project Structure

pdf-processing-mcp/
ā”œā”€ā”€ README.md                  # This file
ā”œā”€ā”€ requirements.txt           # Dependencies
ā”œā”€ā”€ pyproject.toml            # Python project configuration
ā”œā”€ā”€ test_mcp.py               # Test suite
ā”œā”€ā”€ .env.example              # Environment configuration template
└── src/                      # Source code
    ā”œā”€ā”€ server.py             # Main MCP server
    ā”œā”€ā”€ config.py             # Configuration management
    ā”œā”€ā”€ exceptions.py         # Custom exceptions
    ā”œā”€ā”€ tools/                # PDF processing tools
    │   ā”œā”€ā”€ text_extractor.py
    │   ā”œā”€ā”€ table_extractor.py
    │   ā”œā”€ā”€ form_extractor.py
    │   ā”œā”€ā”€ metadata_extractor.py
    │   └── attachment_processor.py  # NEW: Process PDF attachments
    └── utils/                # Utility modules
        ā”œā”€ā”€ security.py
        └── caching.py

Configuration

Environment Variables

  • PDF_ALLOWED_DIRS: Comma-separated list of allowed directories
  • PDF_MAX_FILE_SIZE: Maximum file size in bytes (default: 100MB)
  • PDF_CACHE_ENABLED: Enable/disable caching (default: true)
  • PDF_DEBUG: Enable debug mode (default: false)

Security

  • File path validation prevents directory traversal attacks
  • Size limits prevent processing of overly large files
  • Directory restrictions limit access to specified paths
  • File type validation ensures only PDF files are processed

Testing

Run the comprehensive test suite:

python test_mcp.py

All tests should pass with "Cache working: True" confirming the caching functionality is operational.

Dependencies

  • Core: mcp (Model Context Protocol)
  • PDF Processing: PyMuPDF, pdfplumber, pandas
  • Utilities: aiofiles, cachetools
  • Development: pytest, black, mypy

License

MIT License

Notes

  • Optimized for Windows compatibility
  • Camelot and python-magic dependencies removed due to compatibility issues
  • Uses PyMuPDF for fast text extraction and pdfplumber for table extraction
  • Comprehensive error handling and logging
  • Fixed: Cache boolean evaluation bug - caching now works correctly
  • NEW: Support for processing PDF attachments directly from Claude Desktop chat
  • Ready for Claude Desktop integration with attachment support