nutrient-pdf-mcp-server

nutrient-pdf-mcp-server

3.3

If you are the rightful owner of nutrient-pdf-mcp-server and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

A powerful Model Context Protocol server for LLM-driven PDF document analysis and exploration.

Nutrient PDF MCP Server

A powerful Model Context Protocol server for LLM-driven PDF document analysis and exploration

A Model Context Protocol (MCP) server for investigating PDF object trees with lazy loading support. This tool allows LLMs to efficiently explore PDF document structure without overwhelming token limits.

Features

  • Lazy Loading: Explore PDF structure without loading entire object trees
  • Path Navigation: Navigate through PDF objects using dot notation (e.g., Pages.Kids.0)
  • Selective Resolution: Resolve specific indirect objects on demand
  • Token Efficient: Massive reduction in response sizes compared to full tree dumps
  • Type Safe: Comprehensive type hints and error handling

Installation

Quick Start

git clone https://github.com/PSPDFKit/nutrient-pdf-mcp-server.git
cd nutrient-pdf-mcp-server
make install-dev  # Sets up development environment

For Claude Code CLI

Recommended: Build and Install

pip install build
make build
pipx install dist/nutrient_pdf_mcp-1.0.0-py3-none-any.whl
claude mcp add nutrient-pdf-mcp nutrient-pdf-mcp

Development Mode

make install-dev
claude mcp add nutrient-pdf-mcp "$(pwd)/venv/bin/python" -m pdf_mcp.server
Manual Configuration
{
  "mcpServers": {
    "nutrient-pdf-mcp": {
      "command": "python",
      "args": ["-m", "pdf_mcp.server"]
    }
  }
}

Available Tools

get_pdf_object_tree

Nutrient PDF MCP Server - Get JSON representation of PDF object tree with lazy loading.

Parameters:

  • pdf_path (required): Path to the PDF file
  • object_id (optional): Specific object ID to retrieve (e.g., '1 0')
  • path (optional): Object path to navigate (e.g., 'Pages.Kids.0')
  • mode (optional): Parsing mode - 'lazy' (default) or 'full'

Examples:

{
  "pdf_path": "document.pdf",
  "mode": "lazy"
}
{
  "pdf_path": "document.pdf",
  "path": "Pages.Kids.0",
  "mode": "lazy"
}
resolve_indirect_object

Nutrient PDF MCP Server - Resolve a specific indirect object by its object and generation numbers.

Parameters:

  • pdf_path (required): Path to the PDF file
  • objnum (required): PDF object number (e.g., 3)
  • gennum (optional): PDF generation number (defaults to 0)
  • depth (optional): Resolution depth - 'shallow' (default) or 'deep'

Examples:

{
  "pdf_path": "document.pdf",
  "objnum": 3,
  "gennum": 0,
  "depth": "shallow"
}

Command Line Usage

# Run the server
make serve

# Or run with debug logging
make serve-debug

Architecture

Core Components

  • parser.py: Main PDF parsing logic with lazy loading support
  • server.py: MCP server implementation
  • types.py: Type definitions for PDF objects and responses
  • exceptions.py: Custom exception classes

Response Types

All PDF objects are serialized into a consistent JSON format:

{
  "type": "dict",
  "value": {
    "/Type": {"type": "name", "value": "/Pages"},
    "/Kids": {
      "type": "array", 
      "value": [
        {"type": "indirect_ref", "objnum": 2, "gennum": 0}
      ]
    }
  }
}

Token Efficiency

The lazy loading system provides massive token savings:

  • Lazy mode: ~5-50 lines (minimal tokens)
  • Shallow resolution: ~50-100 lines (reasonable tokens)
  • Deep resolution: 500+ lines (use sparingly)

Examples

Exploring PDF Structure

  1. Get overview: get_pdf_object_tree(path="document.pdf", mode="lazy")
  2. Navigate to pages: get_pdf_object_tree(path="document.pdf", path="Pages", mode="lazy")
  3. Resolve specific page: resolve_indirect_object(objnum=3, gennum=0, depth="shallow")
  4. Deep dive when needed: resolve_indirect_object(objnum=3, gennum=0, depth="deep")

Path Navigation Examples

  • "Pages" - Navigate to Pages object
  • "Pages.Kids" - Get Kids array from Pages
  • "Pages.Kids.0" - Get first page
  • "Pages.Kids.0.MediaBox.2" - Get width from MediaBox array

Development

Quick Start

# Set up development environment
make install-dev

# Run all quality checks (format, lint, typecheck, test)
make quality

# Or run individual commands
make test          # Run tests
make format        # Format code
make lint          # Run linter
make typecheck     # Type checking

Project Structure

nutrient-pdf-mcp-server/
ā”œā”€ā”€ pdf_mcp/
│   ā”œā”€ā”€ __init__.py
│   ā”œā”€ā”€ server.py          # MCP server
│   ā”œā”€ā”€ parser.py          # PDF parsing logic
│   ā”œā”€ā”€ types.py           # Type definitions
│   └── exceptions.py      # Custom exceptions
ā”œā”€ā”€ tests/                 # Test suite
ā”œā”€ā”€ res/                   # Sample PDFs
ā”œā”€ā”€ pyproject.toml         # Project configuration
└── README.md

Publishing to PyPI

# Build the package
make build

# Upload to test PyPI first
twine upload --repository testpypi dist/*

# Upload to production PyPI
twine upload dist/*

After publishing, users can install with:

pipx install nutrient-pdf-mcp
# or
pip install --user nutrient-pdf-mcp

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes with tests
  4. Ensure code quality checks pass
  5. Submit a pull request

License

MIT License - see LICENSE file for details.

Related Projects