Baronco/Local-Docs-MCP-Tool
If you are the rightful owner of Local-Docs-MCP-Tool and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
A Model Context Protocol (MCP) server designed for interacting with local documents on Windows systems, featuring document discovery, processing, and OCR support.
list_documents
Find documents by path, name, and extension.
load_documents
Extract document content as markdown.
load_scanned_document
Extract text from scanned PDFs using OCR.
š Local Documents MCP Server
A Model Context Protocol (MCP) server for interacting with local documents on Windows systems. This server provides tools to list, load, and process documents with support for OCR on scanned PDFs.
⨠Features
- š Document Discovery: List all documents in a specified directory
- ā” Document Processing: Convert various document formats to markdown
- š OCR Support: Extract text from scanned PDFs using Tesseract OCR
- šÆ Token Management: Automatic content truncation based on token limits
- š Multi-format Support: Handle Word docs, PDFs, PowerPoint, Excel, and more
š ļø Tools Available
list_documents
: Find documents by path, name, and extensionload_documents
: Extract document content as markdownload_scanned_document
: Extract text from scanned PDFs using OCR
š» System Requirements
- Operating System: Windows 10/11
- Python: 3.13 or higher
- Package Manager: uv (recommended)
š Prerequisites Installation
1. š Python 3.13
Download and install Python 3.13 from python.org
2. ā” UV Package Manager
Install uv using pip:
pip install uv
3. š Poppler for Windows
Purpose: Required for PDF processing and conversion to images for OCR.
-
Download the latest Poppler Windows release from: https://github.com/oschwartz10612/poppler-windows/releases/
-
Extract the ZIP file to:
D:\Program Files\poppler-24.08.0
-
The Poppler binaries should be located at:
D:\Program Files\poppler-24.08.0\Library\bin
Alternative locations: You can install Poppler in any directory, just make sure to update the .env
file with the correct path.
4. šļø Tesseract OCR
Purpose: Required for extracting text from scanned documents and images.
-
Download Tesseract for Windows from: https://github.com/UB-Mannheim/tesseract/wiki
-
Install Tesseract following the installer instructions
-
Make sure Tesseract is added to your system PATH, or note the installation directory
š Project Installation
1. š„ Clone or Download the Project
git clone <your-repo-url>
cd LocalDocs
2. š¦ Install Python Dependencies
uv sync
This will install all required dependencies from pyproject.toml
:
markitdown[docx,pdf,pptx,xls,xlsx]>=0.1.2
- Document conversionmcp[cli]>=1.10.1
- MCP server frameworkopencv-python>=4.11.0.86
- Image processingpdf2image>=1.17.0
- PDF to image conversionpytesseract>=0.3.13
- Tesseract OCR wrapperpython-dotenv>=1.1.1
- Environment variable managementtiktoken>=0.9.0
- Token counting
3. āļø Configure Environment Variables
Create or update the .env
file in the project root:
POPPLER_PATH="D:\\Program Files\\poppler-24.08.0\\Library\\bin"
Note: Update the path to match your Poppler installation location.
š§ Configuration for MCP Clients
š¤ Claude Desktop Configuration
Add the following configuration to your Claude Desktop config.json
file:
-
First argument: Path to your documents directory
- Example:
"C:\\Users\\YourUsername\\Documents\\MyDocuments"
- Use double backslashes for Windows paths in JSON
- Example:
-
Second argument: Maximum tokens per document
- Example:
"30000"
- Adjust based on your needs and Claude's token limits
- Example:
š Example Configurations
For different document locations:
{
"mcpServers": {
"local-documents": {
"command": "uv",
"args": [
"--directory",
"C:\\Users\\YourUsername\\Documents\\LocalDocs",
"run",
"server.py",
"C:\\Users\\YourUsername\\Documents\\MyDocuments",
"30000"
]
}
}
}
šÆ Usage
š Starting the Server
The server is automatically started when Claude Desktop loads with the configured settings.
š Available Operations
- š List Documents: Discover all documents in your configured directory
- š Load Standard Documents: Process Word docs, PDFs, PowerPoint, Excel files
- š Load Scanned Documents: Use OCR to extract text from scanned PDFs
š Response Format
The server returns structured responses with:
- Document paths and metadata
- Token usage information
- Processing time (for OCR operations)
- Extracted content in markdown format
š ļø Troubleshooting
ā ļø Common Issues
-
š Poppler not found
- Verify Poppler installation path
- Check
.env
file configuration - Ensure path uses double backslashes in Windows
-
šļø Tesseract not found
- Verify Tesseract installation
- Add Tesseract to system PATH
- Restart command prompt/PowerShell
-
š Permission denied errors
- Ensure the document directory is accessible
- Check file permissions
- Run as administrator if necessary
-
ā Import errors
- Verify all dependencies are installed:
uv sync
- Check Python version:
python --version
- Ensure you're using Python 3.13
- Verify all dependencies are installed:
-
ā³ Large document processing
- Reduce token limit for better performance
- Consider splitting large documents
- Monitor memory usage during OCR operations
š Debug Information
To get more detailed error information, check the Claude Desktop logs or run the server manually in a PowerShell window.
š File Structure
LocalDocs/
āāā server.py # Main MCP server
āāā pyproject.toml # Project dependencies
āāā .env # Environment configuration
āāā README.md # This documentation
āāā src/
ā āāā instructions.md # Assistant instructions
āāā utils/
āāā __init__.py
āāā markitdown.py # Document conversion
āāā max_tokens.py # Token management
āāā ocr.py # OCR processing
āāā path_files.py # File discovery
āāā prompts.py # Instruction loading
š Supported Document Formats
- š Microsoft Office: .docx, .xlsx, .pptx
- š PDF: Regular PDFs and scanned PDFs (via OCR)
ā” Performance Considerations
- š OCR Processing: Scanned documents take significantly longer to process
- šÆ Token Limits: Adjust based on your document sizes and Claude's context window
- š¾ Memory Usage: Large documents and OCR operations can be memory-intensive
š¤ Contributing
When contributing to this project:
- Ensure compatibility with Windows and Python 3.13
- Test with various document formats
- Verify OCR functionality with scanned documents
- Update documentation for any new features
š Related Documentation
- MCP Documentation
- Claude Desktop MCP Guide
- PDF2Image
- Poppler PDF Processing
- Tesseract OCR
- MarkItDown
šŗļø Roadmap and Future Enhancements
š® Planned Features
-
š§ Vector Storage and RAG Integration: Future versions will include vectorial document storage to:
- Reduce token consumption by avoiding repeated text extraction
- Enable semantic search across document collections
- Provide more efficient document retrieval and chunking
- Support for persistent document indexing
-
š Enhanced OCR Validation: Currently, OCR functionality for scanned books has not been fully validated and may encounter issues with:
- Complex layouts and formatting
- Multi-column documents
- Poor quality scans
- Non-standard fonts or languages
š” Current Recommendations
š For Large Context Models
- š¤ Gemini Models: With 1M+ token context windows, you can process very long documents without truncation
- šÆ Token Management: Current implementation supports up to 128K tokens by default, but can be adjusted for larger context models
- š Document Processing: Consider using higher token limits (e.g., 500K-1M) when working with:
- Complete books or long reports
- Multiple related documents
- Comprehensive document analysis
ā ļø Limitations to Consider
- š OCR Reliability: Scanned document processing is experimental and may require manual validation
- ā³ Processing Time: Large documents and OCR operations can be time-intensive
- š¾ Memory Usage: High-resolution scanned documents may require significant system resources