Local-Docs-MCP-Tool by Baronco - MCP Server

📚 Local Documents MCP Server

A Model Context Protocol (MCP) server for interacting with local documents on Windows systems. This server provides tools to list, load, and process documents with support for OCR on scanned PDFs.

✨ Features

📁 Document Discovery: List all documents in a specified directory
⚡ Document Processing: Convert various document formats to markdown
🔍 OCR Support: Extract text from scanned PDFs using Tesseract OCR
🎯 Token Management: Automatic content truncation based on token limits
📄 Multi-format Support: Handle Word docs, PDFs, PowerPoint, Excel, and more

🛠️ Tools Available

list_documents: Find documents by path, name, and extension
load_documents: Extract document content as markdown
load_scanned_document: Extract text from scanned PDFs using OCR

💻 System Requirements

Operating System: Windows 10/11
Python: 3.13 or higher
Package Manager: uv (recommended)

📋 Prerequisites Installation

1. 🐍 Python 3.13

Download and install Python 3.13 from python.org

2. ⚡ UV Package Manager

Install uv using pip:

pip install uv

3. 📖 Poppler for Windows

Purpose: Required for PDF processing and conversion to images for OCR.

Download the latest Poppler Windows release from: https://github.com/oschwartz10612/poppler-windows/releases/
Extract the ZIP file to:
```
D:\Program Files\poppler-24.08.0
```

The Poppler binaries should be located at:

D:\Program Files\poppler-24.08.0\Library\bin

Alternative locations: You can install Poppler in any directory, just make sure to update the .env file with the correct path.

4. 👁️ Tesseract OCR

Purpose: Required for extracting text from scanned documents and images.

Download Tesseract for Windows from: https://github.com/UB-Mannheim/tesseract/wiki
Install Tesseract following the installer instructions
Make sure Tesseract is added to your system PATH, or note the installation directory

🚀 Project Installation

1. 📥 Clone or Download the Project

git clone <your-repo-url>
cd LocalDocs

2. 📦 Install Python Dependencies

uv sync

This will install all required dependencies from pyproject.toml:

markitdown[docx,pdf,pptx,xls,xlsx]>=0.1.2 - Document conversion
mcp[cli]>=1.10.1 - MCP server framework
opencv-python>=4.11.0.86 - Image processing
pdf2image>=1.17.0 - PDF to image conversion
pytesseract>=0.3.13 - Tesseract OCR wrapper
python-dotenv>=1.1.1 - Environment variable management
tiktoken>=0.9.0 - Token counting

3. ⚙️ Configure Environment Variables

Create or update the .env file in the project root:

POPPLER_PATH="D:\\Program Files\\poppler-24.08.0\\Library\\bin"

Note: Update the path to match your Poppler installation location.

🔧 Configuration for MCP Clients

🤖 Claude Desktop Configuration

Add the following configuration to your Claude Desktop config.json file:

First argument: Path to your documents directory
- Example: "C:\\Users\\YourUsername\\Documents\\MyDocuments"
- Use double backslashes for Windows paths in JSON
Second argument: Maximum tokens per document
- Example: "30000"
- Adjust based on your needs and Claude's token limits

📝 Example Configurations

For different document locations:

{
  "mcpServers": {
    "local-documents": {
      "command": "uv",
      "args": [
        "--directory",
        "C:\\Users\\YourUsername\\Documents\\LocalDocs",
        "run",
        "server.py",
        "C:\\Users\\YourUsername\\Documents\\MyDocuments",
        "30000"
      ]
    }
  }
}

🎯 Usage

🚀 Starting the Server

The server is automatically started when Claude Desktop loads with the configured settings.

🔄 Available Operations

📋 List Documents: Discover all documents in your configured directory
📄 Load Standard Documents: Process Word docs, PDFs, PowerPoint, Excel files
🔍 Load Scanned Documents: Use OCR to extract text from scanned PDFs

📊 Response Format

The server returns structured responses with:

Document paths and metadata
Token usage information
Processing time (for OCR operations)
Extracted content in markdown format

🛠️ Troubleshooting

⚠️ Common Issues

🔍 Poppler not found
- Verify Poppler installation path
- Check .env file configuration
- Ensure path uses double backslashes in Windows
👁️ Tesseract not found
- Verify Tesseract installation
- Add Tesseract to system PATH
- Restart command prompt/PowerShell
🔐 Permission denied errors
- Ensure the document directory is accessible
- Check file permissions
- Run as administrator if necessary
❌ Import errors
- Verify all dependencies are installed: uv sync
- Check Python version: python --version
- Ensure you're using Python 3.13
⏳ Large document processing
- Reduce token limit for better performance
- Consider splitting large documents
- Monitor memory usage during OCR operations

🐛 Debug Information

To get more detailed error information, check the Claude Desktop logs or run the server manually in a PowerShell window.

📁 File Structure

LocalDocs/
├── server.py              # Main MCP server
├── pyproject.toml         # Project dependencies
├── .env                   # Environment configuration
├── README.md              # This documentation
├── src/
│   └── instructions.md    # Assistant instructions
└── utils/
    ├── __init__.py
    ├── markitdown.py      # Document conversion
    ├── max_tokens.py      # Token management
    ├── ocr.py             # OCR processing
    ├── path_files.py      # File discovery
    └── prompts.py         # Instruction loading

📄 Supported Document Formats

📊 Microsoft Office: .docx, .xlsx, .pptx
📖 PDF: Regular PDFs and scanned PDFs (via OCR)

⚡ Performance Considerations

🔍 OCR Processing: Scanned documents take significantly longer to process
🎯 Token Limits: Adjust based on your document sizes and Claude's context window
💾 Memory Usage: Large documents and OCR operations can be memory-intensive

🤝 Contributing

When contributing to this project:

Ensure compatibility with Windows and Python 3.13
Test with various document formats
Verify OCR functionality with scanned documents
Update documentation for any new features

📚 Related Documentation

🗺️ Roadmap and Future Enhancements

🔮 Planned Features

🧠 Vector Storage and RAG Integration: Future versions will include vectorial document storage to:
- Reduce token consumption by avoiding repeated text extraction
- Enable semantic search across document collections
- Provide more efficient document retrieval and chunking
- Support for persistent document indexing
🔍 Enhanced OCR Validation: Currently, OCR functionality for scanned books has not been fully validated and may encounter issues with:
- Complex layouts and formatting
- Multi-column documents
- Poor quality scans
- Non-standard fonts or languages

💡 Current Recommendations

🚀 For Large Context Models

🤖 Gemini Models: With 1M+ token context windows, you can process very long documents without truncation
🎯 Token Management: Current implementation supports up to 128K tokens by default, but can be adjusted for larger context models
📖 Document Processing: Consider using higher token limits (e.g., 500K-1M) when working with:
- Complete books or long reports
- Multiple related documents
- Comprehensive document analysis

⚠️ Limitations to Consider

🔍 OCR Reliability: Scanned document processing is experimental and may require manual validation
⏳ Processing Time: Large documents and OCR operations can be time-intensive
💾 Memory Usage: High-resolution scanned documents may require significant system resources