wesleygriffin/pdfrag
If you are the rightful owner of pdfrag and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.
A Model Context Protocol (MCP) server that provides powerful RAG (Retrieval-Augmented Generation) capabilities for PDF documents.
PDF RAG MCP Server
A Model Context Protocol (MCP) server that provides powerful RAG (Retrieval-Augmented Generation) capabilities for PDF documents. This server uses ChromaDB for vector storage, sentence-transformers for embeddings, and semantic chunking for intelligent text segmentation.
Features
- ✅ Semantic Chunking: Intelligently groups sentences together instead of splitting at arbitrary character limits
- ✅ Vector Search: Find semantically similar content using embeddings
- ✅ Keyword Search: Traditional keyword-based search for exact terms
- ✅ OCR Support: Automatic detection and OCR processing for scanned/image-based PDFs
- ✅ Source Tracking: Maintains document names and page numbers for all chunks
- ✅ Add/Remove PDFs: Easily manage your document collection
- ✅ Persistent Storage: ChromaDB persists your embeddings to disk
- ✅ Multiple Output Formats: Get results in Markdown or JSON format
- ✅ Progress Reporting: Real-time feedback during long operations
Architecture
- Embedding Model:
multi-qa-mpnet-base-dot-v1(optimized for question-answering) - Vector Database: ChromaDB with cosine similarity
- Chunking Strategy: Semantic chunking with configurable sentence grouping and overlap
- PDF Extraction: PyMuPDF for text extraction with OCR fallback for scanned PDFs
Installation
From Source
- Clone the repository:
git clone <repository-url>
cd pdfrag
- Install the package:
pip install -e .
- Verify installation:
pdfrag --help
pdfrag-cli --help
NLTK Data (Automatic)
The server automatically downloads required NLTK punkt tokenizer data on first run.
Tesseract (Optional - for OCR)
For scanned PDF support, install Tesseract:
- macOS:
brew install tesseract - Ubuntu/Debian:
sudo apt-get install tesseract-ocr - Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki
The server automatically detects scanned pages and uses OCR when Tesseract is available.
Configuration
Database Location
The server stores its ChromaDB database in a configurable location. You can specify the database path using the --db-path command line argument:
# Use default location (~/.dotfiles/files/mcps/pdfrag/chroma_db)
pdfrag
# Use custom database location
pdfrag --db-path /path/to/your/database
Chunking Parameters
Default chunking settings:
- Chunk Size: 3 sentences per chunk
- Overlap: 1 sentence overlap between chunks
These can be customized when adding PDFs:
{
"pdf_path": "/path/to/document.pdf",
"chunk_size": 5, # Use 5 sentences per chunk
"overlap": 2 # 2 sentences overlap
}
Character Limit
Responses are limited to 25,000 characters by default. If exceeded, results are automatically truncated with a warning message.
Project Structure
pdfrag/
├── src/pdfrag/ # Main package
│ ├── server.py # FastMCP server with 5 tools
│ ├── database.py # ChromaDB interface
│ ├── embeddings.py # Embedding generation
│ ├── pdf.py # PDF text extraction
│ ├── chunking.py # Semantic chunking
│ └── cli.py # MCP CLI tool
├── tests/ # Test suite
├── docs/ # Documentation
├── examples/ # Configuration examples
└── pyproject.toml # Package configuration
MCP Tools
1. pdf_add
Add a PDF document to the RAG database.
Input:
{
"pdf_path": "/absolute/path/to/document.pdf",
"chunk_size": 3, // optional, default: 3
"overlap": 1 // optional, default: 1
}
Output:
{
"status": "success",
"message": "Successfully added 'document.pdf' to the database",
"document_id": "a1b2c3d4...",
"filename": "document.pdf",
"pages": 15,
"chunks": 127,
"chunk_size": 3,
"overlap": 1
}
Example Use Cases:
- Adding research papers for reference
- Indexing documentation
- Building a searchable knowledge base
2. pdf_remove
Remove a PDF document from the database.
Input:
{
"document_id": "a1b2c3d4..." // Get from pdf_list
}
Output:
{
"status": "success",
"message": "Successfully removed 'document.pdf' from the database",
"document_id": "a1b2c3d4...",
"removed_chunks": 127
}
3. pdf_list
List all PDF documents in the database.
Input:
{
"response_format": "markdown" // or "json"
}
Output (Markdown):
# PDF Documents (2 total)
## research_paper.pdf
**Document ID:** a1b2c3d4...
**Chunks:** 127
**Added:** N/A
## documentation.pdf
**Document ID:** e5f6g7h8...
**Chunks:** 89
**Added:** N/A
Output (JSON):
{
"count": 2,
"documents": [
{
"document_id": "a1b2c3d4...",
"filename": "research_paper.pdf",
"chunk_count": 127
},
{
"document_id": "e5f6g7h8...",
"filename": "documentation.pdf",
"chunk_count": 89
}
]
}
4. pdf_search_similarity
Search using semantic similarity (vector search).
Input:
{
"query": "machine learning techniques for text classification",
"top_k": 5, // optional, default: 5
"document_filter": null, // optional, search specific doc
"response_format": "markdown" // optional, default: markdown
}
Output (Markdown):
# Search Results for: 'machine learning techniques for text classification'
Found 5 relevant chunks:
## Result 1
**Document:** research_paper.pdf
**Page:** 7
**Similarity Score:** 0.8754
**Content:**
Machine learning approaches to text classification have evolved significantly...
---
Use Cases:
- Finding relevant information without exact keywords
- Discovering related concepts
- Question answering over documents
5. pdf_search_keywords
Search using keyword matching.
Input:
{
"keywords": "neural network backpropagation",
"top_k": 5, // optional, default: 5
"document_filter": null, // optional
"response_format": "markdown" // optional, default: markdown
}
Output:
Similar to pdf_search_similarity, but ranked by keyword occurrence count.
Use Cases:
- Finding specific technical terms
- Locating exact phrases or terminology
- Verifying presence of keywords in documents
Usage with Claude Desktop
1. Add to Claude Desktop Configuration
Edit your Claude Desktop config file:
macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json
Add the server:
{
"mcpServers": {
"pdf-rag": {
"command": "pdfrag",
"args": ["--db-path", "/path/to/your/chroma_db"],
"env": {
"PYTHONUNBUFFERED": "1"
}
}
}
}
See examples/claude_desktop_config.json for a complete example.
2. Restart Claude Desktop
After adding the configuration, restart Claude Desktop to load the MCP server.
3. Test the Connection
In Claude Desktop, try:
Can you list the PDFs in the RAG database?
Claude will use the pdf_list tool to show available documents.
Example Workflows
Building a Research Database
1. Add documents:
"Add these PDFs to the database: /research/paper1.pdf, /research/paper2.pdf"
2. Search for concepts:
"Search for information about 'gradient descent optimization' in the database"
3. Find specific terms:
"Search for the keyword 'convolutional neural network' and show me the pages"
Document Q&A
1. Add documentation:
"Add this user manual: /docs/product_manual.pdf"
2. Ask questions:
"How do I configure the network settings according to the manual?"
3. Find references:
"Which page discusses troubleshooting connection errors?"
Knowledge Base Management
1. List documents:
"Show me all documents in the RAG database"
2. Remove outdated docs:
"Remove the document with ID a1b2c3d4..."
3. Search across all:
"Search all documents for information about API authentication"
Advanced Configuration
Custom Chunk Sizes
For different document types:
Technical Documents (code, APIs):
- Smaller chunks (2-3 sentences)
- Minimal overlap (0-1 sentences)
- Preserves code structure
Narrative Documents (articles, books):
- Larger chunks (5-7 sentences)
- More overlap (2-3 sentences)
- Maintains context flow
Scientific Papers:
- Medium chunks (3-5 sentences)
- Moderate overlap (1-2 sentences)
- Balances detail and context
Document Filtering
Search within specific documents:
{
"query": "data preprocessing",
"document_filter": "a1b2c3d4..." // Only search this doc
}
Output Format Selection
Choose format based on use case:
Markdown: Best for human reading, Claude's analysis JSON: Best for programmatic processing, data extraction
Troubleshooting
"File not found" Error
Ensure you're using absolute paths:
"/home/user/documents/paper.pdf" ✅
"~/documents/paper.pdf" ❌ (needs expansion)
"./paper.pdf" ❌ (relative path)
Empty PDF Results / Scanned PDFs
The server automatically detects and processes scanned PDFs using OCR. If you get an error about no text being extracted:
-
Install Tesseract (if not already installed):
- macOS:
brew install tesseract - Ubuntu/Debian:
sudo apt-get install tesseract-ocr - Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki
- macOS:
-
Retry adding the PDF - the server will automatically use OCR for pages with minimal text
The error message will indicate if OCR is needed: "ensure tesseract is installed for scanned PDFs"
Out of Memory
If processing large PDFs causes memory issues:
- Reduce
chunk_sizeto create more, smaller chunks - Process documents one at a time
- Increase system swap space
ChromaDB Errors
If ChromaDB complains about existing collections:
# Remove the database directory
rm -rf ./chroma_db
# Restart the server
Performance Considerations
Embedding Generation
The first time you add a document, the model will be downloaded (~400MB). Subsequent operations are faster.
Typical Times:
- 10-page PDF: ~5-10 seconds
- 100-page PDF: ~30-60 seconds
- 1000-page PDF: ~5-10 minutes
Search Performance
- Similarity Search: Fast (< 1 second for most queries)
- Keyword Search: Slower for large collections (scales with document count)
Storage
- Embeddings: ~1.5KB per chunk (768-dimensional vectors)
- Text Storage: Depends on chunk size
- Example: 1000 chunks ≈ 1.5MB in ChromaDB
Best Practices
1. Organize Documents
Use descriptive filenames:
research_ml_2024.pdf ✅
document (1).pdf ❌
2. Test Chunk Sizes
Different documents benefit from different chunking:
# Try multiple chunk sizes for the same document
pdf_add(path="doc.pdf", chunk_size=3, overlap=1) # Test 1
pdf_remove(document_id="...") # Remove
pdf_add(path="doc.pdf", chunk_size=5, overlap=2) # Test 2
3. Use Document Filters
When searching specific documents:
# More focused, faster results
pdf_search_similarity(
query="...",
document_filter="specific_doc_id"
)
4. Combine Search Types
Use both search methods for comprehensive results:
- Semantic search for concepts
- Keyword search for exact terms
Security Notes
- File Access: Server can read any PDF the Python process can access
- Storage: Embeddings and text stored unencrypted in ChromaDB
- No Authentication: MCP servers trust the client (Claude Desktop)
For production use:
- Restrict file system permissions
- Use dedicated database directories
- Consider encryption for sensitive documents
Contributing
To extend this server:
- Add New Tools: Follow the
@mcp.tool()decorator pattern - Custom Chunking: Implement in
semantic_chunking()function - Additional Embeddings: Swap models in initialization
- Metadata: Extend
metadatasdict inpdf_add()
License
MIT License - See LICENSE file for details
Acknowledgments
- Anthropic: MCP Protocol and SDK
- ChromaDB: Vector database
- Sentence Transformers: Embedding models
- PyMuPDF: PDF text extraction and OCR support
Support
For issues or questions:
- Check the troubleshooting section
- Review MCP documentation: https://modelcontextprotocol.io
- Check ChromaDB docs: https://docs.trychroma.com
Built with ❤️ using the Model Context Protocol