labeveryday/mcp_pdf_reader
If you are the rightful owner of mcp_pdf_reader and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
The MCP PDF Reader Server is a robust server built using FastMCP, designed to handle comprehensive PDF processing tasks such as text and image extraction, and OCR for reading text within images.
MCP PDF Reader Server (Python + FastMCP)
A powerful Model Context Protocol (MCP) server built with FastMCP that provides comprehensive PDF processing capabilities including text extraction, image extraction, and OCR for reading text within images.
Features
- Text Extraction: Extract text content from PDF pages
- Image Extraction: Extract all images from PDF files
- OCR Capabilities: Read text from images using Tesseract OCR
- Comprehensive Analysis: Get detailed PDF structure and metadata
- Page Range Support: Process specific page ranges
- Multiple Languages: OCR support for multiple languages
Prerequisites
System Dependencies
Tesseract OCR
You need to install Tesseract OCR on your system:
Ubuntu/Debian:
sudo apt update
sudo apt install tesseract-ocr tesseract-ocr-eng
macOS:
brew install tesseract
Windows:
- Download from: https://github.com/UB-Mannheim/tesseract/wiki
- Install and add to PATH
- Or use:
conda install -c conda-forge tesseract
Additional Language Packs (Optional)
# For multiple languages
sudo apt install tesseract-ocr-fra tesseract-ocr-deu tesseract-ocr-spa
Installation
Quick Start with UV
- Install UV (if not already installed):
# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# Windows
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
- Clone/Create the project:
mkdir mcp-pdf-reader-server
cd mcp-pdf-reader-server
- Initialize and install with UV:
# Copy the files (pdf_reader_server.py and pyproject.toml)
# Then install dependencies
uv sync
- Verify installation:
uv run python -c "import pytesseract; print(pytesseract.get_tesseract_version())"
Alternative: Manual Setup
If you prefer traditional setup:
- Create virtual environment:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
- Install dependencies:
pip install fastmcp PyMuPDF pytesseract Pillow
Usage
Running the Server
With UV:
uv run python pdf_reader_server.py
Or if you have the environment activated:
python pdf_reader_server.py
The server will start and listen for MCP requests on stdin/stdout.
Available Tools
1. read_pdf_text
Extract text content from PDF pages.
Parameters:
file_path
(string, required): Path to the PDF filepage_range
(object, optional): Dict withstart
andend
page numbers
Example:
{
"file_path": "/path/to/document.pdf",
"page_range": {"start": 1, "end": 5}
}
2. extract_pdf_images
Extract all images from a PDF file.
Parameters:
file_path
(string, required): Path to the PDF fileoutput_dir
(string, optional): Directory to save imagespage_range
(object, optional): Page range to process
Example:
{
"file_path": "/path/to/document.pdf",
"output_dir": "/path/to/images/",
"page_range": {"start": 1, "end": 3}
}
3. read_pdf_with_ocr
Extract text from both regular text and images using OCR.
Parameters:
file_path
(string, required): Path to the PDF filepage_range
(object, optional): Page range to processocr_language
(string, optional): OCR language code (default: "eng")
Example:
{
"file_path": "/path/to/document.pdf",
"ocr_language": "eng+fra",
"page_range": {"start": 1, "end": 10}
}
Supported OCR Languages:
eng
- Englishfra
- Frenchdeu
- Germanspa
- Spanisheng+fra
- Multiple languages
4. get_pdf_info
Get comprehensive metadata and statistics about a PDF.
Parameters:
file_path
(string, required): Path to the PDF file
5. analyze_pdf_structure
Analyze the structure and content distribution of a PDF.
Parameters:
file_path
(string, required): Path to the PDF file
Configuration with Claude Desktop
With UV
Add this to your claude_desktop_config.json
:
{
"mcpServers": {
"pdf-reader": {
"command": "uv",
"args": ["run", "python", "/path/to/your/pdf_reader_server.py"],
"cwd": "/path/to/your/mcp-pdf-reader-server"
}
}
}
With Virtual Environment
{
"mcpServers": {
"pdf-reader": {
"command": "/path/to/your/.venv/bin/python",
"args": ["/path/to/your/pdf_reader_server.py"]
}
}
}
System Python
{
"mcpServers": {
"pdf-reader": {
"command": "python",
"args": ["/path/to/your/pdf_reader_server.py"],
"env": {
"PYTHONPATH": "/path/to/your/.venv/lib/python3.x/site-packages"
}
}
}
}
Example Responses
Text Extraction Response
{
"success": true,
"file_path": "/path/to/document.pdf",
"pages_processed": "1-3",
"total_pages": 10,
"pages_text": [
{
"page_number": 1,
"text": "Page 1 content...",
"word_count": 125
}
],
"combined_text": "All text combined...",
"total_word_count": 1250,
"total_character_count": 8750
}
OCR Response
{
"success": true,
"file_path": "/path/to/document.pdf",
"pages_processed": "1-2",
"ocr_language": "eng",
"pages_data": [
{
"page_number": 1,
"text": "Regular text from PDF...",
"ocr_text": "Text extracted from images...",
"images_with_text": [
{
"image_index": 1,
"ocr_text": "Text from image 1",
"confidence": "high"
}
],
"combined_text": "Combined text and OCR...",
"text_word_count": 100,
"ocr_word_count": 25
}
],
"summary": {
"total_text_word_count": 200,
"total_ocr_word_count": 50,
"combined_word_count": 250,
"images_processed": 3
},
"all_text_combined": "All extracted text..."
}
Performance Considerations
OCR Performance
- OCR processing can be slow for large images
- Consider processing smaller page ranges for faster results
- Images smaller than 50x50 pixels are automatically skipped
Memory Usage
- Large PDFs with many images may consume significant memory
- The server processes pages sequentially to manage memory usage
- Extracted images are saved to disk to reduce memory pressure
Optimization Tips
- Use page ranges for large documents
- Specify output directories for image extraction to avoid temp file buildup
- Choose appropriate OCR languages to improve accuracy and speed
- Preprocess images if OCR quality is poor (consider adding OpenCV)
Troubleshooting
Common Issues
-
Tesseract not found:
TesseractNotFoundError: tesseract is not installed
- Install Tesseract OCR system package
- Ensure it's in your PATH
-
Permission errors:
- Ensure the Python process has read access to PDF files
- Ensure write access to output directories
-
Poor OCR results:
- Try different OCR language codes
- Consider image preprocessing
- Check if images are high enough resolution
-
Memory errors:
- Process smaller page ranges
- Close other applications
- Consider increasing available RAM
Debug Mode
Run with debug logging using UV:
PYTHONUNBUFFERED=1 uv run python pdf_reader_server.py
Or with regular Python:
PYTHONUNBUFFERED=1 python pdf_reader_server.py
Testing OCR
Test Tesseract directly:
tesseract --list-langs
tesseract image.png output.txt
Dependencies
- fastmcp: Modern MCP server framework
- PyMuPDF: Fast PDF processing and rendering
- pytesseract: Python wrapper for Tesseract OCR
- Pillow: Image processing library
- tesseract-ocr: System OCR engine
Advanced Features
Custom OCR Configuration
You can modify the OCR configuration in the code:
ocr_text = pytesseract.image_to_string(
pil_image,
lang=ocr_language,
config='--psm 6 -c tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz '
)
Image Preprocessing
For better OCR results, consider adding image preprocessing:
# Add to requirements: opencv-python, numpy
import cv2
import numpy as np
# Preprocessing example
def preprocess_image(image):
gray = cv2.cvtColor(np.array(image), cv2.COLOR_RGB2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
return Image.fromarray(thresh)
Contributing
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Submit a pull request
License
MIT License - see LICENSE file for details.