vladesv/pdf-processor-mcp-server
If you are the rightful owner of pdf-processor-mcp-server and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
A comprehensive MCP server for PDF processing that provides text extraction, table extraction, form processing, and metadata retrieval capabilities.
extract_pdf_text
Extract text from PDF files.
extract_pdf_tables
Extract tables from PDF files.
extract_pdf_forms
Extract form fields and values.
get_pdf_metadata
Get PDF metadata and document information.
process_pdf_attachment
Process PDF attachments directly from Claude Desktop chat.
PDF Processing MCP Server
A comprehensive MCP (Model Context Protocol) server for PDF processing that provides text extraction, table extraction, form processing, and metadata retrieval capabilities.
Features
- Text Extraction: Extract text from PDF files with optional layout preservation
- Table Extraction: Extract tables using pdfplumber
- Form Processing: Extract form fields and their values
- Metadata Retrieval: Get comprehensive PDF metadata and document information
- Security: File path validation, size limits, and directory restrictions
- Performance: Fully functional caching and async processing
Installation
- Clone the repository:
git clone <repository-url>
cd pdf-processing-mcp
- Install dependencies:
pip install -r requirements.txt
- Test the installation:
python test_mcp.py
Usage
As MCP Server
The server can be used with Claude Desktop or any MCP-compatible client.
Claude Desktop Configuration
Add to your claude_desktop_config.json
:
{
"mcpServers": {
"pdf-processor": {
"command": "python",
"args": ["-m", "src.server"],
"cwd": "/path/to/pdf-processing-mcp",
"env": {
"PDF_ALLOWED_DIRS": "./documents,./downloads",
"PDF_CACHE_ENABLED": "true",
"PDF_MAX_FILE_SIZE": "104857600"
}
}
}
}
Running the Server
python -m src.server
Available Tools
-
extract_pdf_text
- Extract text from PDF files
- Parameters:
file_path
,pages
(optional),preserve_layout
(optional)
-
extract_pdf_tables
- Extract tables from PDF files
- Parameters:
file_path
,pages
(optional),method
(optional)
-
extract_pdf_forms
- Extract form fields and values
- Parameters:
file_path
,include_values
(optional)
-
get_pdf_metadata
- Get PDF metadata and document information
- Parameters:
file_path
-
process_pdf_attachment ā NEW
- Process PDF attachments directly from Claude Desktop chat
- Parameters:
pdf_data
(base64),extraction_type
(optional),options
(optional) - Extraction types:
text
,tables
,forms
,metadata
,all
Project Structure
pdf-processing-mcp/
āāā README.md # This file
āāā requirements.txt # Dependencies
āāā pyproject.toml # Python project configuration
āāā test_mcp.py # Test suite
āāā .env.example # Environment configuration template
āāā src/ # Source code
āāā server.py # Main MCP server
āāā config.py # Configuration management
āāā exceptions.py # Custom exceptions
āāā tools/ # PDF processing tools
ā āāā text_extractor.py
ā āāā table_extractor.py
ā āāā form_extractor.py
ā āāā metadata_extractor.py
ā āāā attachment_processor.py # NEW: Process PDF attachments
āāā utils/ # Utility modules
āāā security.py
āāā caching.py
Configuration
Environment Variables
PDF_ALLOWED_DIRS
: Comma-separated list of allowed directoriesPDF_MAX_FILE_SIZE
: Maximum file size in bytes (default: 100MB)PDF_CACHE_ENABLED
: Enable/disable caching (default: true)PDF_DEBUG
: Enable debug mode (default: false)
Security
- File path validation prevents directory traversal attacks
- Size limits prevent processing of overly large files
- Directory restrictions limit access to specified paths
- File type validation ensures only PDF files are processed
Testing
Run the comprehensive test suite:
python test_mcp.py
All tests should pass with "Cache working: True" confirming the caching functionality is operational.
Dependencies
- Core: mcp (Model Context Protocol)
- PDF Processing: PyMuPDF, pdfplumber, pandas
- Utilities: aiofiles, cachetools
- Development: pytest, black, mypy
License
MIT License
Notes
- Optimized for Windows compatibility
- Camelot and python-magic dependencies removed due to compatibility issues
- Uses PyMuPDF for fast text extraction and pdfplumber for table extraction
- Comprehensive error handling and logging
- Fixed: Cache boolean evaluation bug - caching now works correctly
- NEW: Support for processing PDF attachments directly from Claude Desktop chat
- Ready for Claude Desktop integration with attachment support