Baronco/Local-Docs-MCP-Tool
If you are the rightful owner of Local-Docs-MCP-Tool and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.
A Model Context Protocol (MCP) server designed for interacting with local documents on Windows systems, featuring document discovery, processing, and OCR support.
📚 Local Documents MCP Server
A Model Context Protocol (MCP) server for interacting with local documents on Windows systems. This server provides tools to list, load, and process documents with support for OCR on scanned PDFs.
✨ Features
- 📁 Document Discovery: List all documents in a specified directory
- ⚡ Document Processing: Convert various document formats to markdown
- 🔍 OCR Support: Extract text from scanned PDFs using Tesseract OCR
- 🎯 Token Management: Automatic content truncation based on token limits
- 📄 Multi-format Support: Handle Word docs, PDFs, PowerPoint, Excel, and more
🛠️ Tools Available
list_documents: Find documents by path, name, and extensionload_documents: Extract document content as markdownload_scanned_document: Extract text from scanned PDFs using OCR
💻 System Requirements
- Operating System: Windows 10/11
- Python: 3.13 or higher
- Package Manager: uv (recommended)
📋 Prerequisites Installation
1. 🐍 Python 3.13
Download and install Python 3.13 from python.org
2. ⚡ UV Package Manager
Install uv using pip:
pip install uv
3. 📖 Poppler for Windows
Purpose: Required for PDF processing and conversion to images for OCR.
-
Download the latest Poppler Windows release from: https://github.com/oschwartz10612/poppler-windows/releases/
-
Extract the ZIP file to:
D:\Program Files\poppler-24.08.0 -
The Poppler binaries should be located at:
D:\Program Files\poppler-24.08.0\Library\bin
Alternative locations: You can install Poppler in any directory, just make sure to update the .env file with the correct path.
4. 👁️ Tesseract OCR
Purpose: Required for extracting text from scanned documents and images.
-
Download Tesseract for Windows from: https://github.com/UB-Mannheim/tesseract/wiki
-
Install Tesseract following the installer instructions
-
Make sure Tesseract is added to your system PATH, or note the installation directory
🚀 Project Installation
1. 📥 Clone or Download the Project
git clone <your-repo-url>
cd LocalDocs
2. 📦 Install Python Dependencies
uv sync
This will install all required dependencies from pyproject.toml:
markitdown[docx,pdf,pptx,xls,xlsx]>=0.1.2- Document conversionmcp[cli]>=1.10.1- MCP server frameworkopencv-python>=4.11.0.86- Image processingpdf2image>=1.17.0- PDF to image conversionpytesseract>=0.3.13- Tesseract OCR wrapperpython-dotenv>=1.1.1- Environment variable managementtiktoken>=0.9.0- Token counting
3. ⚙️ Configure Environment Variables
Create or update the .env file in the project root:
POPPLER_PATH="D:\\Program Files\\poppler-24.08.0\\Library\\bin"
Note: Update the path to match your Poppler installation location.
🔧 Configuration for MCP Clients
🤖 Claude Desktop Configuration
Add the following configuration to your Claude Desktop config.json file:
-
First argument: Path to your documents directory
- Example:
"C:\\Users\\YourUsername\\Documents\\MyDocuments" - Use double backslashes for Windows paths in JSON
- Example:
-
Second argument: Maximum tokens per document
- Example:
"30000" - Adjust based on your needs and Claude's token limits
- Example:
📝 Example Configurations
For different document locations:
{
"mcpServers": {
"local-documents": {
"command": "uv",
"args": [
"--directory",
"C:\\Users\\YourUsername\\Documents\\LocalDocs",
"run",
"server.py",
"C:\\Users\\YourUsername\\Documents\\MyDocuments",
"30000"
]
}
}
}
🎯 Usage
🚀 Starting the Server
The server is automatically started when Claude Desktop loads with the configured settings.
🔄 Available Operations
- 📋 List Documents: Discover all documents in your configured directory
- 📄 Load Standard Documents: Process Word docs, PDFs, PowerPoint, Excel files
- 🔍 Load Scanned Documents: Use OCR to extract text from scanned PDFs
📊 Response Format
The server returns structured responses with:
- Document paths and metadata
- Token usage information
- Processing time (for OCR operations)
- Extracted content in markdown format
🛠️ Troubleshooting
⚠️ Common Issues
-
🔍 Poppler not found
- Verify Poppler installation path
- Check
.envfile configuration - Ensure path uses double backslashes in Windows
-
👁️ Tesseract not found
- Verify Tesseract installation
- Add Tesseract to system PATH
- Restart command prompt/PowerShell
-
🔐 Permission denied errors
- Ensure the document directory is accessible
- Check file permissions
- Run as administrator if necessary
-
❌ Import errors
- Verify all dependencies are installed:
uv sync - Check Python version:
python --version - Ensure you're using Python 3.13
- Verify all dependencies are installed:
-
⏳ Large document processing
- Reduce token limit for better performance
- Consider splitting large documents
- Monitor memory usage during OCR operations
🐛 Debug Information
To get more detailed error information, check the Claude Desktop logs or run the server manually in a PowerShell window.
📁 File Structure
LocalDocs/
├── server.py # Main MCP server
├── pyproject.toml # Project dependencies
├── .env # Environment configuration
├── README.md # This documentation
├── src/
│ └── instructions.md # Assistant instructions
└── utils/
├── __init__.py
├── markitdown.py # Document conversion
├── max_tokens.py # Token management
├── ocr.py # OCR processing
├── path_files.py # File discovery
└── prompts.py # Instruction loading
📄 Supported Document Formats
- 📊 Microsoft Office: .docx, .xlsx, .pptx
- 📖 PDF: Regular PDFs and scanned PDFs (via OCR)
⚡ Performance Considerations
- 🔍 OCR Processing: Scanned documents take significantly longer to process
- 🎯 Token Limits: Adjust based on your document sizes and Claude's context window
- 💾 Memory Usage: Large documents and OCR operations can be memory-intensive
🤝 Contributing
When contributing to this project:
- Ensure compatibility with Windows and Python 3.13
- Test with various document formats
- Verify OCR functionality with scanned documents
- Update documentation for any new features
📚 Related Documentation
- MCP Documentation
- Claude Desktop MCP Guide
- PDF2Image
- Poppler PDF Processing
- Tesseract OCR
- MarkItDown
🗺️ Roadmap and Future Enhancements
🔮 Planned Features
-
🧠 Vector Storage and RAG Integration: Future versions will include vectorial document storage to:
- Reduce token consumption by avoiding repeated text extraction
- Enable semantic search across document collections
- Provide more efficient document retrieval and chunking
- Support for persistent document indexing
-
🔍 Enhanced OCR Validation: Currently, OCR functionality for scanned books has not been fully validated and may encounter issues with:
- Complex layouts and formatting
- Multi-column documents
- Poor quality scans
- Non-standard fonts or languages
💡 Current Recommendations
🚀 For Large Context Models
- 🤖 Gemini Models: With 1M+ token context windows, you can process very long documents without truncation
- 🎯 Token Management: Current implementation supports up to 128K tokens by default, but can be adjusted for larger context models
- 📖 Document Processing: Consider using higher token limits (e.g., 500K-1M) when working with:
- Complete books or long reports
- Multiple related documents
- Comprehensive document analysis
⚠️ Limitations to Consider
- 🔍 OCR Reliability: Scanned document processing is experimental and may require manual validation
- ⏳ Processing Time: Large documents and OCR operations can be time-intensive
- 💾 Memory Usage: High-resolution scanned documents may require significant system resources