mcp_server_knowledge_engine by lhstorm - MCP Server

MCP Server Knowledge Engine

A powerful Model Context Protocol (MCP) server that transforms any PDF document collection into an intelligent, searchable knowledge base accessible through Claude Desktop. This server features advanced search capabilities using TF-IDF scoring, proximity matching, and domain-specific optimization.

🌟 Key Features

🔍 Advanced Search Engine: TF-IDF-based inverted index with proximity matching for highly relevant results
📄 Universal PDF Support: Process any PDF collection - technical docs, legal papers, research, and more
⚡ High Performance: Cached search index, incremental processing, and background initialization
🎯 Domain Optimization: Configure domain-specific keywords for enhanced search accuracy
⚙️ Fully Configurable: JSON-based configuration with environment variable support
🛠️ Comprehensive CLI: Complete server management through intuitive commands
🔗 Seamless MCP Integration: Ready-to-use with Claude Desktop, VS Code, and other MCP clients
📊 Smart Caching: MD5 hash-based change detection for efficient updates

📋 Quick Start

Prerequisites

Python 3.8 or higher
pip (Python package manager)
Claude Desktop app (for MCP integration)

1. Installation

# Clone the repository
git clone https://github.com/lhstorm/mcp_server_knowledge_engine.git
cd mcp_server_knowledge_engine

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

2. Create Your Server

# Interactive setup
python manage_server.py create-config

# This will ask you for:
# - Server name (e.g., 'legal-docs-server')
# - Display name (e.g., 'Legal Documents Server')
# - PDF folder location
# - Domain-specific keywords

3. Add PDF Documents

# Add individual PDFs
python manage_server.py add-pdf /path/to/document.pdf
python manage_server.py add-pdf /path/to/another-doc.pdf

# Or copy PDFs directly to your configured folder

4. Process Documents

# Convert PDFs to searchable format
python manage_server.py process-pdfs

5. Generate MCP Configuration

# Generate configuration for Claude Desktop
python generate_mcp_config.py --merge

# Or get the config to copy manually
python generate_mcp_config.py

6. Start Using with Claude

Restart Claude Desktop and your server will appear in the MCP tools menu!

💬 Using with Claude Desktop

Once configured, you can interact with your PDFs naturally:

Example prompts:

"Search for information about [topic] in the documentation"
"What does the documentation say about [specific feature]?"
"Find all references to [keyword] across all PDFs"
"Show me the content of [document name]"
"List all available documents"

Advanced usage:

"Search for [term1] near [term2]" - Leverages proximity matching
"Get page 15 of [document]" - Retrieves specific pages
"Find the top 10 results for [query]" - Adjusts result count

📁 Project Structure

mcp_server_knowledge_engine/
├── server.py              # Main MCP server with search engine
├── config.py              # Configuration management & validation
├── manage_server.py       # CLI for server management
├── generate_mcp_config.py # MCP configuration generator
├── convert_pdfs.py        # Standalone PDF conversion utility
├── server_config.json     # Active server configuration
├── requirements.txt       # Python dependencies
├── examples/              # Example configurations
│   ├── legal_docs_config.json
│   ├── medical_docs_config.json
│   ├── research_papers_config.json
│   └── tech_docs_config.json
└── your-pdfs/             # Your PDF folder (configurable)
    ├── document1.pdf
    ├── document2.pdf
    └── markdown/          # Auto-generated cache
        ├── .pdf_cache.json      # Processing metadata
        ├── .search_index.pkl    # Cached search index
        ├── document1.md         # Converted documents
        └── document2.md

⚙️ Configuration

The server is configured via server_config.json:

{
  "server": {
    "name": "my-docs-server",
    "display_name": "My Documents Server", 
    "description": "Search through my PDF collection",
    "version": "1.0.0"
  },
  "storage": {
    "pdf_folder": "./docs",
    "markdown_folder": "./docs/markdown",
    "domain_keywords": ["keyword1", "keyword2", "domain-term"]
  },
  "tools": {
    "search": {
      "name": "search_docs",
      "description": "Search through PDF documentation"
    },
    "list": {
      "name": "list_docs", 
      "description": "List all available documents"
    },
    "content": {
      "name": "get_document_content",
      "description": "Get full content from documents"
    },
    "max_results_default": 5
  },
  "processing": {
    "cache_enabled": true,
    "parallel_processing": true,
    "max_file_size_mb": 50,
    "context_size": 500
  }
}

🛠️ Management Commands

Server Management

# Create new configuration
python manage_server.py create-config

# Test configuration
python manage_server.py test

# Generate MCP config
python manage_server.py generate-mcp-config

PDF Management

# List all PDFs
python manage_server.py list-pdfs

# Add PDF
python manage_server.py add-pdf document.pdf

# Remove PDF  
python manage_server.py remove-pdf document.pdf

# Process all PDFs
python manage_server.py process-pdfs

MCP Configuration

# Print MCP config
python generate_mcp_config.py

# Automatically merge with Claude Desktop config
python generate_mcp_config.py --merge

# Save to file
python generate_mcp_config.py --output my_mcp_config.json

💡 Usage Examples

Legal Documents Server

{
  "server": {
    "name": "legal-docs-server",
    "display_name": "Legal Documents Server"
  },
  "storage": {
    "domain_keywords": ["contract", "liability", "jurisdiction", "plaintiff", "defendant"]
  }
}

Technical Documentation Server

{
  "server": {
    "name": "tech-docs-server", 
    "display_name": "Technical Documentation Server"
  },
  "storage": {
    "domain_keywords": ["API", "function", "class", "method", "parameter", "return"]
  }
}

Research Papers Server

{
  "server": {
    "name": "research-server",
    "display_name": "Research Papers Server"
  },
  "storage": {
    "domain_keywords": ["hypothesis", "methodology", "results", "conclusion", "analysis"]
  }
}

🔧 Available MCP Tools

Each server provides three configurable tools:

Search Tool (default: search_docs)
- Intelligent search through all documents
- TF-IDF scoring with proximity matching
- Returns relevant excerpts with context
List Tool (default: list_docs)
- Lists all available documents
- Shows document metadata and page counts
Content Tool (default: get_document_content)
- Retrieves full document content
- Can fetch specific pages
- Includes complete markdown formatting

🎯 Domain Customization

The server adapts to your domain through:

Domain Keywords: Configure terms important to your field
Tool Names: Customize tool names (e.g., search_legal_docs)
Descriptions: Tailor descriptions for your use case
Context Size: Adjust how much context to return in search results

🔍 How the Search Engine Works

Inverted Index Architecture

The server uses an advanced inverted index for lightning-fast searches:

Document Processing: PDFs are converted to markdown and tokenized
Index Building: Words are mapped to their locations (document, page, position)
TF-IDF Scoring:
- TF (Term Frequency): How often a word appears in a document
- IDF (Inverse Document Frequency): How rare a word is across all documents
- Combined score ensures relevant, unique results rank higher

Search Features

Proximity Boosting: Multi-word queries score higher when terms appear close together
Context Extraction: Returns relevant snippets with search terms highlighted
Domain Keyword Recognition: Configured keywords get special treatment
Page-Level Precision: Results include specific page numbers
Smart Caching: Search index persists between server restarts

📊 Performance Optimizations

Incremental Processing: MD5 hash-based change detection - only new/modified PDFs are processed
Persistent Search Index: Pickled index loads instantly on server restart
Background Initialization: Server accepts connections while building index
Memory Efficiency: Streaming PDF processing and markdown storage
Configurable Limits: Control file size limits and processing parameters

🐛 Troubleshooting

Common Issues & Solutions

Server not appearing in Claude Desktop:

Ensure MCP configuration was merged: python generate_mcp_config.py --merge
Check Python path: which python or where python (Windows)
Verify server_config.json exists and is valid JSON
Restart Claude Desktop after configuration changes

PDFs not processing:

Check folder permissions: ls -la /path/to/pdf/folder
Verify PDF files aren't corrupted: file document.pdf
Look for errors in stderr: python server.py 2>error.log
Ensure sufficient disk space for markdown cache

Search returns no/poor results:

Initial indexing may take time - check stderr for progress
Verify markdown files exist: ls markdown/*.md
Check search index exists: ls markdown/.search_index.pkl
Try single-word queries first, then expand
Review domain keywords in configuration

Server crashes or hangs:

Check Python version (3.8+ required): python --version
Verify all dependencies installed: pip install -r requirements.txt
Clear cache and reprocess: rm -rf markdown/.pdf_cache.json markdown/.search_index.pkl
Check for file locking issues on Windows

Debug Mode

# Run with full debug output
python server.py 2>&1 | tee debug.log

# Check server initialization
grep "initialization" debug.log

# Monitor PDF processing
grep "Processing\|Error" debug.log

Validation Commands

# Test configuration validity
python manage_server.py test

# Verify configuration loading
python -c "from config import load_config_from_env_or_file; c=load_config_from_env_or_file(); print(f'✓ Config loaded: {c.server.name}')"

# Check MCP integration
python generate_mcp_config.py  # Should output valid JSON

🚀 Advanced Usage

Multiple Servers

You can run multiple specialized servers:

# Legal documents server
python manage_server.py --config legal_config.json create-config

# Technical docs server  
python manage_server.py --config tech_config.json create-config

# Research papers server
python manage_server.py --config research_config.json create-config

Batch Processing

# Process multiple PDF folders
for folder in docs legal_docs tech_docs; do
    python convert_pdfs.py "$folder" "$folder/markdown"
done

Custom Keywords

Configure domain-specific keywords for better search relevance:

{
  "storage": {
    "domain_keywords": [
      "algorithm", "data structure", "complexity",
      "optimization", "performance", "scalability"
    ]
  }
}

🏗️ Architecture Overview

Core Components

SearchIndex Class (server.py:27-140)
- Implements inverted index with TF-IDF scoring
- Handles word tokenization and document indexing
- Provides proximity-based ranking for multi-word queries
GenericPDFServer Class (server.py:142-661)
- Main server implementation with MCP protocol handling
- Manages PDF processing pipeline
- Handles async operations and background initialization
Configuration System (config.py)
- Dataclass-based type-safe configuration
- JSON schema validation
- Environment variable support
Management CLI (manage_server.py)
- Interactive configuration creation
- PDF management operations
- Server testing and validation

Data Flow

PDFs → PDF Reader → Markdown Converter → Search Index → MCP Tools → Claude
         ↓                    ↓                ↓
    [.pdf files]      [.md cache files]  [.search_index.pkl]

🔄 Current Server Configuration

The repository currently includes a configuration for QuantConnect documentation (server_config.json). To create your own server:

# Option 1: Interactive setup
python manage_server.py create-config

# Option 2: Copy and modify an example
cp examples/tech_docs_config.json server_config.json
# Edit server_config.json with your settings

📚 Example Use Cases

Legal Firms: Search through contracts, case files, and legal documents
Research Labs: Query scientific papers and technical reports
Software Teams: Access API documentation and technical specs
Medical Practices: Search patient records and medical literature
Educational Institutions: Browse course materials and textbooks

🤝 Contributing

We welcome contributions! Here are some ways to help:

Enhancement Ideas

Document Format Support: Add support for Word, HTML, or other formats
Search Improvements: Implement semantic search, fuzzy matching, or ML-based ranking
Performance: Add database backend, parallel processing, or distributed indexing
Tools: Create specialized MCP tools for specific domains
UI: Build a web interface for configuration management

Development Guidelines

Follow existing code style and patterns
Add tests for new functionality
Update documentation for new features
Submit PRs with clear descriptions

🔐 Security Considerations

The server only has read access to specified PDF folders
No external network calls are made during operation
Sensitive data remains local - nothing is sent to external services
Configure appropriate file permissions for your PDF folders

📄 License

This project is open source. See LICENSE file for details.

🙏 Acknowledgments

Built with the Model Context Protocol by Anthropic.

Ready to transform your PDFs into a searchable knowledge base?

Run python manage_server.py create-config to get started! 🚀

📦 Dependencies

mcp: Model Context Protocol SDK for building MCP servers
PyPDF2: PDF parsing and text extraction
asyncio: Asynchronous I/O for concurrent operations
jsonschema: JSON validation for configuration files

All dependencies are lightweight and have minimal system requirements.