yairwein/document-mcp
If you are the rightful owner of document-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
MCP Document Indexer is a Python-based server for local document indexing and search using LanceDB vector database and local LLMs.
MCP Document Indexer
A Python-based MCP (Model Context Protocol) server for local document indexing and search using LanceDB vector database and local LLMs.
Features
- Real-time Document Monitoring: Automatically indexes new and modified documents in configured folders
- Multi-format Support: Handles PDF, Word (docx/doc), text, Markdown, and RTF files
- Local LLM Integration: Uses Ollama for document summarization and keyword extraction. Nothing ever leaves your computer
- Vector Search: Semantic search using LanceDB and sentence transformers
- MCP Integration: Exposes search and catalog tools via Model Context Protocol
- Incremental Indexing: Only processes changed files to save resources
- Performance Optimized: Designed for decent performance on standard laptops (e.g. M1/M2 MacBook)
Installation
Prerequisites
- Python 3.9+ installed
- uv package manager:
curl -LsSf https://astral.sh/uv/install.sh | sh
- Ollama (for local LLM):
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull a model (e.g., llama3.2)
ollama pull llama3.2:3b
Install MCP Document Indexer
# Clone the repository
git clone https://github.com/yairwein/mcp-doc-indexer.git
cd mcp-doc-indexer
# Install with uv
uv sync
# Or install as a package
uv add mcp-doc-indexer
Configuration
Configure the indexer using environment variables or a .env
file:
# Folders to monitor (comma-separated)
WATCH_FOLDERS="/Users/me/Documents,/Users/me/Research"
# LanceDB storage path
LANCEDB_PATH="./vector_index"
# Ollama model for summarization
LLM_MODEL="llama3.2:3b"
# Text chunking settings
CHUNK_SIZE=1000
CHUNK_OVERLAP=200
# Embedding model (sentence-transformers)
EMBEDDING_MODEL="all-MiniLM-L6-v2"
# File types to index
FILE_EXTENSIONS=".pdf,.docx,.doc,.txt,.md,.rtf"
# Maximum file size in MB
MAX_FILE_SIZE_MB=100
# Ollama API URL
OLLAMA_BASE_URL="http://localhost:11434"
Usage
Run as Standalone Service
# Set environment variables
export WATCH_FOLDERS="/path/to/documents"
export LANCEDB_PATH="./my_index"
# Run the indexer
uv run python -m src.main
Integrate with Claude Desktop
Add to your Claude Desktop configuration (~/Library/Application Support/Claude/claude_desktop_config.json
):
{
"mcpServers": {
"doc-indexer": {
"command": "uv",
"args": [
"run",
"--directory",
"/path/to/mcp-doc-indexer",
"python",
"-m",
"src.main"
],
"env": {
"WATCH_FOLDERS": "/Users/me/Documents,/Users/me/Research",
"LANCEDB_PATH": "/Users/me/.mcp-doc-index",
"LLM_MODEL": "llama3.2:3b"
}
}
}
}
MCP Tools
The indexer exposes the following tools via MCP:
search_documents
Search for documents using natural language queries.
- Parameters:
query
: Search query textlimit
: Maximum number of results (default: 10)search_type
: "documents" or "chunks"
get_catalog
List all indexed documents with summaries.
- Parameters:
skip
: Number of documents to skip (default: 0)limit
: Maximum documents to return (default: 100)
get_document_info
Get detailed information about a specific document.
- Parameters:
file_path
: Path to the document
reindex_document
Force reindexing of a specific document.
- Parameters:
file_path
: Path to the document to reindex
get_indexing_stats
Get current indexing statistics.
Example Usage in Claude
Once configured, you can use the indexer in Claude:
"Search my documents for information about machine learning"
"Show me all PDFs I've indexed"
"What documents mention Python programming?"
"Get details about /Users/me/Documents/report.pdf"
"Reindex the latest version of my thesis"
Architecture
βββββββββββββββββββ ββββββββββββββββ βββββββββββββββ
β File Monitor ββββββΆβ Document ββββββΆβ Local LLM β
β (Watchdog) β β Parser β β (Ollama) β
βββββββββββββββββββ ββββββββββββββββ βββββββββββββββ
β β
βΌ βΌ
ββββββββββββββββ βββββββββββββββ
β LanceDB βββββββ Embeddings β
β Storage β β (ST Model) β
ββββββββββββββββ βββββββββββββββ
β
βΌ
ββββββββββββββββ
β FastMCP β
β Server β
ββββββββββββββββ
β
βΌ
ββββββββββββββββ
β Claude β
β Desktop β
ββββββββββββββββ
File Processing Pipeline
- File Detection: Watchdog monitors configured folders for changes
- Document Parsing: Extracts text from PDF, Word, and text files
- Text Chunking: Splits documents into overlapping chunks for better retrieval
- LLM Processing: Generates summaries and extracts keywords using Ollama
- Embedding Generation: Creates vector embeddings using sentence transformers
- Vector Storage: Stores documents and chunks in LanceDB
- MCP Exposure: Makes search and catalog tools available via MCP
Performance Considerations
- Incremental Indexing: Only changed files are reprocessed
- Async Processing: Parallel processing of multiple documents
- Batch Operations: Efficient batch indexing for multiple files
- Debouncing: Prevents duplicate processing of rapidly changing files
- Size Limits: Configurable maximum file size to prevent memory issues
Troubleshooting
Ollama Not Available
If Ollama is not running or the model isn't available, the indexer falls back to simple text extraction without summarization.
# Check Ollama status
ollama list
# Pull required model
ollama pull llama3.2:3b
Permission Issues
Ensure the indexer has read access to monitored folders:
chmod -R 755 /path/to/documents
Memory Usage
For large document collections, consider:
- Reducing
CHUNK_SIZE
to create smaller chunks - Limiting
MAX_FILE_SIZE_MB
to skip very large files - Using a smaller embedding model
Development
Running Tests
uv run pytest tests/
Code Formatting
uv run black src/
uv run ruff src/
Building Package
uv build
License
MIT License - See LICENSE file for details
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Submit a pull request
Support
For issues or questions:
- Open an issue on GitHub
- Check the troubleshooting section
- Review logs in the console output