lhstorm/mcp_server_knowledge_engine
If you are the rightful owner of mcp_server_knowledge_engine and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
The MCP Server Knowledge Engine is a robust server that converts PDF document collections into an intelligent, searchable knowledge base, accessible through Claude Desktop.
search_docs
Intelligent search through all documents with TF-IDF scoring and proximity matching.
list_docs
Lists all available documents with metadata and page counts.
get_document_content
Retrieves full document content, including specific pages with markdown formatting.
MCP Server Knowledge Engine
A powerful Model Context Protocol (MCP) server that transforms any PDF document collection into an intelligent, searchable knowledge base accessible through Claude Desktop. This server features advanced search capabilities using TF-IDF scoring, proximity matching, and domain-specific optimization.
š Key Features
- š Advanced Search Engine: TF-IDF-based inverted index with proximity matching for highly relevant results
- š Universal PDF Support: Process any PDF collection - technical docs, legal papers, research, and more
- ā” High Performance: Cached search index, incremental processing, and background initialization
- šÆ Domain Optimization: Configure domain-specific keywords for enhanced search accuracy
- āļø Fully Configurable: JSON-based configuration with environment variable support
- š ļø Comprehensive CLI: Complete server management through intuitive commands
- š Seamless MCP Integration: Ready-to-use with Claude Desktop, VS Code, and other MCP clients
- š Smart Caching: MD5 hash-based change detection for efficient updates
š Quick Start
Prerequisites
- Python 3.8 or higher
- pip (Python package manager)
- Claude Desktop app (for MCP integration)
1. Installation
# Clone the repository
git clone https://github.com/lhstorm/mcp_server_knowledge_engine.git
cd mcp_server_knowledge_engine
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
2. Create Your Server
# Interactive setup
python manage_server.py create-config
# This will ask you for:
# - Server name (e.g., 'legal-docs-server')
# - Display name (e.g., 'Legal Documents Server')
# - PDF folder location
# - Domain-specific keywords
3. Add PDF Documents
# Add individual PDFs
python manage_server.py add-pdf /path/to/document.pdf
python manage_server.py add-pdf /path/to/another-doc.pdf
# Or copy PDFs directly to your configured folder
4. Process Documents
# Convert PDFs to searchable format
python manage_server.py process-pdfs
5. Generate MCP Configuration
# Generate configuration for Claude Desktop
python generate_mcp_config.py --merge
# Or get the config to copy manually
python generate_mcp_config.py
6. Start Using with Claude
Restart Claude Desktop and your server will appear in the MCP tools menu!
š¬ Using with Claude Desktop
Once configured, you can interact with your PDFs naturally:
Example prompts:
- "Search for information about [topic] in the documentation"
- "What does the documentation say about [specific feature]?"
- "Find all references to [keyword] across all PDFs"
- "Show me the content of [document name]"
- "List all available documents"
Advanced usage:
- "Search for [term1] near [term2]" - Leverages proximity matching
- "Get page 15 of [document]" - Retrieves specific pages
- "Find the top 10 results for [query]" - Adjusts result count
š Project Structure
mcp_server_knowledge_engine/
āāā server.py # Main MCP server with search engine
āāā config.py # Configuration management & validation
āāā manage_server.py # CLI for server management
āāā generate_mcp_config.py # MCP configuration generator
āāā convert_pdfs.py # Standalone PDF conversion utility
āāā server_config.json # Active server configuration
āāā requirements.txt # Python dependencies
āāā examples/ # Example configurations
ā āāā legal_docs_config.json
ā āāā medical_docs_config.json
ā āāā research_papers_config.json
ā āāā tech_docs_config.json
āāā your-pdfs/ # Your PDF folder (configurable)
āāā document1.pdf
āāā document2.pdf
āāā markdown/ # Auto-generated cache
āāā .pdf_cache.json # Processing metadata
āāā .search_index.pkl # Cached search index
āāā document1.md # Converted documents
āāā document2.md
āļø Configuration
The server is configured via server_config.json
:
{
"server": {
"name": "my-docs-server",
"display_name": "My Documents Server",
"description": "Search through my PDF collection",
"version": "1.0.0"
},
"storage": {
"pdf_folder": "./docs",
"markdown_folder": "./docs/markdown",
"domain_keywords": ["keyword1", "keyword2", "domain-term"]
},
"tools": {
"search": {
"name": "search_docs",
"description": "Search through PDF documentation"
},
"list": {
"name": "list_docs",
"description": "List all available documents"
},
"content": {
"name": "get_document_content",
"description": "Get full content from documents"
},
"max_results_default": 5
},
"processing": {
"cache_enabled": true,
"parallel_processing": true,
"max_file_size_mb": 50,
"context_size": 500
}
}
š ļø Management Commands
Server Management
# Create new configuration
python manage_server.py create-config
# Test configuration
python manage_server.py test
# Generate MCP config
python manage_server.py generate-mcp-config
PDF Management
# List all PDFs
python manage_server.py list-pdfs
# Add PDF
python manage_server.py add-pdf document.pdf
# Remove PDF
python manage_server.py remove-pdf document.pdf
# Process all PDFs
python manage_server.py process-pdfs
MCP Configuration
# Print MCP config
python generate_mcp_config.py
# Automatically merge with Claude Desktop config
python generate_mcp_config.py --merge
# Save to file
python generate_mcp_config.py --output my_mcp_config.json
š” Usage Examples
Legal Documents Server
{
"server": {
"name": "legal-docs-server",
"display_name": "Legal Documents Server"
},
"storage": {
"domain_keywords": ["contract", "liability", "jurisdiction", "plaintiff", "defendant"]
}
}
Technical Documentation Server
{
"server": {
"name": "tech-docs-server",
"display_name": "Technical Documentation Server"
},
"storage": {
"domain_keywords": ["API", "function", "class", "method", "parameter", "return"]
}
}
Research Papers Server
{
"server": {
"name": "research-server",
"display_name": "Research Papers Server"
},
"storage": {
"domain_keywords": ["hypothesis", "methodology", "results", "conclusion", "analysis"]
}
}
š§ Available MCP Tools
Each server provides three configurable tools:
-
Search Tool (default:
search_docs
)- Intelligent search through all documents
- TF-IDF scoring with proximity matching
- Returns relevant excerpts with context
-
List Tool (default:
list_docs
)- Lists all available documents
- Shows document metadata and page counts
-
Content Tool (default:
get_document_content
)- Retrieves full document content
- Can fetch specific pages
- Includes complete markdown formatting
šÆ Domain Customization
The server adapts to your domain through:
- Domain Keywords: Configure terms important to your field
- Tool Names: Customize tool names (e.g.,
search_legal_docs
) - Descriptions: Tailor descriptions for your use case
- Context Size: Adjust how much context to return in search results
š How the Search Engine Works
Inverted Index Architecture
The server uses an advanced inverted index for lightning-fast searches:
- Document Processing: PDFs are converted to markdown and tokenized
- Index Building: Words are mapped to their locations (document, page, position)
- TF-IDF Scoring:
- TF (Term Frequency): How often a word appears in a document
- IDF (Inverse Document Frequency): How rare a word is across all documents
- Combined score ensures relevant, unique results rank higher
Search Features
- Proximity Boosting: Multi-word queries score higher when terms appear close together
- Context Extraction: Returns relevant snippets with search terms highlighted
- Domain Keyword Recognition: Configured keywords get special treatment
- Page-Level Precision: Results include specific page numbers
- Smart Caching: Search index persists between server restarts
š Performance Optimizations
- Incremental Processing: MD5 hash-based change detection - only new/modified PDFs are processed
- Persistent Search Index: Pickled index loads instantly on server restart
- Background Initialization: Server accepts connections while building index
- Memory Efficiency: Streaming PDF processing and markdown storage
- Configurable Limits: Control file size limits and processing parameters
š Troubleshooting
Common Issues & Solutions
Server not appearing in Claude Desktop:
- Ensure MCP configuration was merged:
python generate_mcp_config.py --merge
- Check Python path:
which python
orwhere python
(Windows) - Verify server_config.json exists and is valid JSON
- Restart Claude Desktop after configuration changes
PDFs not processing:
- Check folder permissions:
ls -la /path/to/pdf/folder
- Verify PDF files aren't corrupted:
file document.pdf
- Look for errors in stderr:
python server.py 2>error.log
- Ensure sufficient disk space for markdown cache
Search returns no/poor results:
- Initial indexing may take time - check stderr for progress
- Verify markdown files exist:
ls markdown/*.md
- Check search index exists:
ls markdown/.search_index.pkl
- Try single-word queries first, then expand
- Review domain keywords in configuration
Server crashes or hangs:
- Check Python version (3.8+ required):
python --version
- Verify all dependencies installed:
pip install -r requirements.txt
- Clear cache and reprocess:
rm -rf markdown/.pdf_cache.json markdown/.search_index.pkl
- Check for file locking issues on Windows
Debug Mode
# Run with full debug output
python server.py 2>&1 | tee debug.log
# Check server initialization
grep "initialization" debug.log
# Monitor PDF processing
grep "Processing\|Error" debug.log
Validation Commands
# Test configuration validity
python manage_server.py test
# Verify configuration loading
python -c "from config import load_config_from_env_or_file; c=load_config_from_env_or_file(); print(f'ā Config loaded: {c.server.name}')"
# Check MCP integration
python generate_mcp_config.py # Should output valid JSON
š Advanced Usage
Multiple Servers
You can run multiple specialized servers:
# Legal documents server
python manage_server.py --config legal_config.json create-config
# Technical docs server
python manage_server.py --config tech_config.json create-config
# Research papers server
python manage_server.py --config research_config.json create-config
Batch Processing
# Process multiple PDF folders
for folder in docs legal_docs tech_docs; do
python convert_pdfs.py "$folder" "$folder/markdown"
done
Custom Keywords
Configure domain-specific keywords for better search relevance:
{
"storage": {
"domain_keywords": [
"algorithm", "data structure", "complexity",
"optimization", "performance", "scalability"
]
}
}
šļø Architecture Overview
Core Components
-
SearchIndex Class (
server.py:27-140
)- Implements inverted index with TF-IDF scoring
- Handles word tokenization and document indexing
- Provides proximity-based ranking for multi-word queries
-
GenericPDFServer Class (
server.py:142-661
)- Main server implementation with MCP protocol handling
- Manages PDF processing pipeline
- Handles async operations and background initialization
-
Configuration System (
config.py
)- Dataclass-based type-safe configuration
- JSON schema validation
- Environment variable support
-
Management CLI (
manage_server.py
)- Interactive configuration creation
- PDF management operations
- Server testing and validation
Data Flow
PDFs ā PDF Reader ā Markdown Converter ā Search Index ā MCP Tools ā Claude
ā ā ā
[.pdf files] [.md cache files] [.search_index.pkl]
š Current Server Configuration
The repository currently includes a configuration for QuantConnect documentation (server_config.json
). To create your own server:
# Option 1: Interactive setup
python manage_server.py create-config
# Option 2: Copy and modify an example
cp examples/tech_docs_config.json server_config.json
# Edit server_config.json with your settings
š Example Use Cases
- Legal Firms: Search through contracts, case files, and legal documents
- Research Labs: Query scientific papers and technical reports
- Software Teams: Access API documentation and technical specs
- Medical Practices: Search patient records and medical literature
- Educational Institutions: Browse course materials and textbooks
š¤ Contributing
We welcome contributions! Here are some ways to help:
Enhancement Ideas
- Document Format Support: Add support for Word, HTML, or other formats
- Search Improvements: Implement semantic search, fuzzy matching, or ML-based ranking
- Performance: Add database backend, parallel processing, or distributed indexing
- Tools: Create specialized MCP tools for specific domains
- UI: Build a web interface for configuration management
Development Guidelines
- Follow existing code style and patterns
- Add tests for new functionality
- Update documentation for new features
- Submit PRs with clear descriptions
š Security Considerations
- The server only has read access to specified PDF folders
- No external network calls are made during operation
- Sensitive data remains local - nothing is sent to external services
- Configure appropriate file permissions for your PDF folders
š License
This project is open source. See LICENSE file for details.
š Acknowledgments
Built with the Model Context Protocol by Anthropic.
Ready to transform your PDFs into a searchable knowledge base?
Run python manage_server.py create-config
to get started! š
š¦ Dependencies
- mcp: Model Context Protocol SDK for building MCP servers
- PyPDF2: PDF parsing and text extraction
- asyncio: Asynchronous I/O for concurrent operations
- jsonschema: JSON validation for configuration files
All dependencies are lightweight and have minimal system requirements.