mcp-pdf

rsp2k/mcp-pdf

3.3

If you are the rightful owner of mcp-pdf and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.

MCP PDF is a cutting-edge platform designed to transform PDFs into structured, actionable intelligence using AI-powered tools.

Tools
5
Resources
0
Prompts
0

📄 MCP PDF

A FastMCP server for PDF processing

46 tools for text extraction, OCR, tables, forms, annotations, and more

Python 3.11+ FastMCP License: MIT PyPI

Works great with MCP Office Tools


What It Does

MCP PDF extracts content from PDFs using multiple libraries with automatic fallbacks. If one method fails, it tries another.

Core capabilities:

  • Text extraction via PyMuPDF, pdfplumber, or pypdf (auto-fallback)
  • Table extraction via Camelot, pdfplumber, or Tabula (auto-fallback)
  • OCR for scanned documents via Tesseract
  • Form handling - extract, fill, and create PDF forms
  • Document assembly - merge, split, reorder pages
  • Annotations - sticky notes, highlights, stamps
  • Vector graphics - extract to SVG for schematics and technical drawings

Quick Start

# Install from PyPI
uvx mcp-pdf

# Or add to Claude Code
claude mcp add pdf-tools uvx mcp-pdf
Development Installation
git clone https://github.com/rsp2k/mcp-pdf
cd mcp-pdf
uv sync

# System dependencies (Ubuntu/Debian)
sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript

# Verify
uv run python examples/verify_installation.py

Tools

Content Extraction

ToolWhat it does
extract_textPull text from PDF pages with automatic chunking for large files
extract_tablesExtract tables to JSON, CSV, or Markdown
extract_imagesExtract embedded images
extract_linksGet all hyperlinks with page filtering
pdf_to_markdownConvert PDF to markdown preserving structure
ocr_pdfOCR scanned documents using Tesseract
extract_vector_graphicsExport vector graphics to SVG (schematics, charts, drawings)

Document Analysis

ToolWhat it does
extract_metadataGet title, author, creation date, page count, etc.
get_document_structureExtract table of contents and bookmarks
analyze_layoutDetect columns, headers, footers
is_scanned_pdfCheck if PDF needs OCR
compare_pdfsDiff two PDFs by text, structure, or metadata
analyze_pdf_healthCheck for corruption, optimization opportunities
analyze_pdf_securityReport encryption, permissions, signatures

Forms

ToolWhat it does
extract_form_dataGet form field names and values
fill_form_pdfFill form fields from JSON
create_form_pdfCreate new forms with text fields, checkboxes, dropdowns
add_form_fieldsAdd fields to existing PDFs

Permit Forms (Coordinate-Based)

For scanned PDFs or forms without interactive fields. Draws text at (x, y) coordinates.

ToolWhat it does
fill_permit_formFill any PDF by drawing at coordinates (works with scanned forms)
get_field_schemaGet field definitions for validation or UI generation
validate_permit_form_dataCheck data against field schema before filling
preview_field_positionsGenerate PDF showing field boundaries (debugging)
insert_attachment_pagesInsert image/text pages with "See page X" references

Requires: pip install mcp-pdf[forms] (adds reportlab dependency)

Document Assembly

ToolWhat it does
merge_pdfsCombine multiple PDFs with bookmark preservation
split_pdf_by_pagesSplit by page ranges
split_pdf_by_bookmarksSplit at chapter/section boundaries
reorder_pdf_pagesRearrange pages in custom order

Annotations

ToolWhat it does
add_sticky_notesAdd comment annotations
add_highlightsHighlight text regions
add_stampsAdd Approved/Draft/Confidential stamps
extract_all_annotationsExport annotations to JSON

How Fallbacks Work

The server tries multiple libraries for each operation:

Text extraction:

  1. PyMuPDF (fastest)
  2. pdfplumber (better for complex layouts)
  3. pypdf (most compatible)

Table extraction:

  1. Camelot (best accuracy, requires Ghostscript)
  2. pdfplumber (no dependencies)
  3. Tabula (requires Java)

If a PDF fails with one library, the next is tried automatically.


Token Management

Large PDFs can overflow MCP response limits. The server handles this:

  • Automatic chunking splits large documents into page groups
  • Table row limits prevent huge tables from blowing up responses
  • Summary mode returns structure without full content
# Get first 10 pages
result = await extract_text("huge.pdf", pages="1-10")

# Limit table rows
tables = await extract_tables("data.pdf", max_rows_per_table=50)

# Structure only
tables = await extract_tables("data.pdf", summary_only=True)

URL Processing

PDFs can be fetched directly from HTTPS URLs:

result = await extract_text("https://example.com/report.pdf")

Files are cached locally for subsequent operations.


System Dependencies

Some features require system packages:

FeatureDependency
OCRtesseract-ocr
Camelot tablesghostscript
Tabula tablesdefault-jre-headless
PDF to imagespoppler-utils

Ubuntu/Debian:

sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript default-jre-headless

Configuration

Optional environment variables:

VariablePurpose
MCP_PDF_ALLOWED_PATHSColon-separated directories for file output
PDF_TEMP_DIRTemp directory for processing (default: /tmp/mcp-pdf-processing)
TESSDATA_PREFIXTesseract language data location

Development

# Run tests
uv run pytest

# With coverage
uv run pytest --cov=mcp_pdf

# Format
uv run black src/ tests/

# Lint
uv run ruff check src/ tests/

License

MIT