jpwebb/pdftotext-mcp
If you are the rightful owner of pdftotext-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
PDFtotext MCP Server is a reliable and lightweight server for extracting text from PDF documents using the pdftotext utility from poppler-utils.
read_pdf_text
Extracts text content from PDF files using pdftotext.
PDFtotext MCP Server
A reliable Model Context Protocol (MCP) server for PDF text extraction using the proven pdftotext
utility from poppler-utils.
๐ Why This Server?
Unlike other PDF MCP servers that suffer from logging interference, complex dependencies, and reliability issues, pdftotext-mcp
is:
- โ Actually works - Clean JSON-RPC communication without stdout pollution
- โ
Reliable - Built on mature
pdftotext
from poppler-utils (used by millions) - โ Lightweight - Minimal dependencies, maximum compatibility
- โ Production tested - Successfully tested with Claude Desktop and other MCP clients
- โ Feature complete - Page-specific extraction, layout preservation, encoding options
- โ Error handling - Comprehensive validation and helpful error messages
๐ Features
- ๐ Extract text from entire PDF documents or specific pages
- ๐จ Preserve original layout formatting (optional)
- ๐ค Multiple text encoding support (UTF-8, Latin1, ASCII)
- ๐ Comprehensive metadata in responses (word count, file info, etc.)
- ๐ก๏ธ File validation and security checks
- โก Fast processing with configurable timeouts
- ๐ Detailed error reporting with troubleshooting hints
๐ง Prerequisites
You must have pdftotext
installed on your system:
Ubuntu/Debian
sudo apt update
sudo apt install poppler-utils
macOS
brew install poppler
Windows
# Using Chocolatey
choco install poppler
# Using Scoop
scoop install poppler
Verify Installation
pdftotext -v
๐ฆ Installation
Option 1: Global Installation (Recommended)
npm install -g pdftotext-mcp
Option 2: Use with npx (No Installation)
npx pdftotext-mcp
Option 3: Local Development
git clone https://github.com/jpwebb/pdftotext-mcp.git
cd pdftotext-mcp
npm install
npm start
โ๏ธ Configuration
Add to your MCP client configuration:
Claude Desktop
Add to claude_desktop_config.json
:
{
"mcpServers": {
"pdftotext": {
"command": "pdftotext-mcp"
}
}
}
Or with npx:
{
"mcpServers": {
"pdftotext": {
"command": "npx",
"args": ["pdftotext-mcp"]
}
}
}
Other MCP Clients
The server works with any MCP-compatible client. Use pdftotext-mcp
as the command.
๐ฏ Usage
The server provides a single, powerful tool: read_pdf_text
Basic Usage
Extract entire document
{
"tool": "read_pdf_text",
"arguments": {
"path": "./document.pdf"
}
}
Extract specific page
{
"tool": "read_pdf_text",
"arguments": {
"path": "./document.pdf",
"page": 2
}
}
Preserve layout formatting
{
"tool": "read_pdf_text",
"arguments": {
"path": "./document.pdf",
"layout": true
}
}
Custom encoding
{
"tool": "read_pdf_text",
"arguments": {
"path": "./document.pdf",
"encoding": "Latin1"
}
}
Response Format
Success Response
{
"success": true,
"file": "document.pdf",
"path": "/absolute/path/to/document.pdf",
"extractedText": "Full text content...",
"pageSpecific": "all",
"layoutPreserved": false,
"encoding": "UTF-8",
"fileSize": 1048576,
"lastModified": "2024-01-15T10:30:00.000Z",
"extractedAt": "2024-01-15T10:35:00.000Z",
"textLength": 5234,
"wordCount": 892
}
Error Response
{
"success": false,
"error": "File not found: ./nonexistent.pdf",
"errorType": "FILE_NOT_FOUND",
"file": "./nonexistent.pdf",
"timestamp": "2024-01-15T10:35:00.000Z"
}
๐ API Reference
Tool: read_pdf_text
Extracts text content from PDF files using pdftotext.
Parameters
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
path | string | โ | - | Path to PDF file (relative or absolute) |
page | number | โ | all pages | Specific page to extract (1-based) |
layout | boolean | โ | false | Preserve original text layout |
encoding | string | โ | "UTF-8" | Output text encoding |
Supported Encodings
UTF-8
(default)Latin1
ASCII
Error Types
FILE_NOT_FOUND
- PDF file doesn't existPERMISSION_DENIED
- Cannot read the fileINVALID_PDF
- File is not a valid PDFPDFTOTEXT_ERROR
- pdftotext utility errorUNKNOWN_ERROR
- Unexpected error
๐ง Troubleshooting
"pdftotext is not available"
Solution: Install poppler-utils (see Prerequisites)
"File not found"
Solutions:
- Use absolute paths:
/home/user/document.pdf
- Check file exists:
ls -la /path/to/file.pdf
- Verify MCP server working directory
"Permission denied"
Solutions:
- Check file permissions:
chmod 644 document.pdf
- Ensure directory is readable:
chmod 755 /path/to/directory/
"File is not a valid PDF"
Solutions:
- Verify file is actually a PDF:
file document.pdf
- Check for file corruption
- Try with a different PDF file
MCP Connection Issues
Solutions:
- Restart your MCP client completely
- Check configuration syntax in config file
- Verify
pdftotext-mcp
is accessible in PATH - Check MCP client logs for detailed errors
๐งช Testing
# Run tests
npm test
# Run tests with watch mode
npm run test:watch
# Run linter
npm run lint
๐ค Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Development Setup
git clone https://github.com/jpwebb/pdftotext-mcp.git
cd pdftotext-mcp
npm install
Running Locally
npm start
Code Style
This project uses ESLint. Run npm run lint
to check code style.
๐ License
- see LICENSE file for details.
๐ Acknowledgments
- Built for the Model Context Protocol ecosystem
- Uses poppler-utils
pdftotext
utility - Inspired by the need for reliable PDF processing in MCP environments
๐ Related
Made for the MCP community