mcp_server_knowledge_engine

lhstorm/mcp_server_knowledge_engine

3.3

If you are the rightful owner of mcp_server_knowledge_engine and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

The MCP Server Knowledge Engine is a robust server that converts PDF document collections into an intelligent, searchable knowledge base, accessible through Claude Desktop.

Tools
  1. search_docs

    Intelligent search through all documents with TF-IDF scoring and proximity matching.

  2. list_docs

    Lists all available documents with metadata and page counts.

  3. get_document_content

    Retrieves full document content, including specific pages with markdown formatting.

MCP Server Knowledge Engine

A powerful Model Context Protocol (MCP) server that transforms any PDF document collection into an intelligent, searchable knowledge base accessible through Claude Desktop. This server features advanced search capabilities using TF-IDF scoring, proximity matching, and domain-specific optimization.

🌟 Key Features

  • šŸ” Advanced Search Engine: TF-IDF-based inverted index with proximity matching for highly relevant results
  • šŸ“„ Universal PDF Support: Process any PDF collection - technical docs, legal papers, research, and more
  • ⚔ High Performance: Cached search index, incremental processing, and background initialization
  • šŸŽÆ Domain Optimization: Configure domain-specific keywords for enhanced search accuracy
  • āš™ļø Fully Configurable: JSON-based configuration with environment variable support
  • šŸ› ļø Comprehensive CLI: Complete server management through intuitive commands
  • šŸ”— Seamless MCP Integration: Ready-to-use with Claude Desktop, VS Code, and other MCP clients
  • šŸ“Š Smart Caching: MD5 hash-based change detection for efficient updates

šŸ“‹ Quick Start

Prerequisites

  • Python 3.8 or higher
  • pip (Python package manager)
  • Claude Desktop app (for MCP integration)

1. Installation

# Clone the repository
git clone https://github.com/lhstorm/mcp_server_knowledge_engine.git
cd mcp_server_knowledge_engine

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

2. Create Your Server

# Interactive setup
python manage_server.py create-config

# This will ask you for:
# - Server name (e.g., 'legal-docs-server')
# - Display name (e.g., 'Legal Documents Server')
# - PDF folder location
# - Domain-specific keywords

3. Add PDF Documents

# Add individual PDFs
python manage_server.py add-pdf /path/to/document.pdf
python manage_server.py add-pdf /path/to/another-doc.pdf

# Or copy PDFs directly to your configured folder

4. Process Documents

# Convert PDFs to searchable format
python manage_server.py process-pdfs

5. Generate MCP Configuration

# Generate configuration for Claude Desktop
python generate_mcp_config.py --merge

# Or get the config to copy manually
python generate_mcp_config.py

6. Start Using with Claude

Restart Claude Desktop and your server will appear in the MCP tools menu!

šŸ’¬ Using with Claude Desktop

Once configured, you can interact with your PDFs naturally:

Example prompts:

  • "Search for information about [topic] in the documentation"
  • "What does the documentation say about [specific feature]?"
  • "Find all references to [keyword] across all PDFs"
  • "Show me the content of [document name]"
  • "List all available documents"

Advanced usage:

  • "Search for [term1] near [term2]" - Leverages proximity matching
  • "Get page 15 of [document]" - Retrieves specific pages
  • "Find the top 10 results for [query]" - Adjusts result count

šŸ“ Project Structure

mcp_server_knowledge_engine/
ā”œā”€ā”€ server.py              # Main MCP server with search engine
ā”œā”€ā”€ config.py              # Configuration management & validation
ā”œā”€ā”€ manage_server.py       # CLI for server management
ā”œā”€ā”€ generate_mcp_config.py # MCP configuration generator
ā”œā”€ā”€ convert_pdfs.py        # Standalone PDF conversion utility
ā”œā”€ā”€ server_config.json     # Active server configuration
ā”œā”€ā”€ requirements.txt       # Python dependencies
ā”œā”€ā”€ examples/              # Example configurations
│   ā”œā”€ā”€ legal_docs_config.json
│   ā”œā”€ā”€ medical_docs_config.json
│   ā”œā”€ā”€ research_papers_config.json
│   └── tech_docs_config.json
└── your-pdfs/             # Your PDF folder (configurable)
    ā”œā”€ā”€ document1.pdf
    ā”œā”€ā”€ document2.pdf
    └── markdown/          # Auto-generated cache
        ā”œā”€ā”€ .pdf_cache.json      # Processing metadata
        ā”œā”€ā”€ .search_index.pkl    # Cached search index
        ā”œā”€ā”€ document1.md         # Converted documents
        └── document2.md

āš™ļø Configuration

The server is configured via server_config.json:

{
  "server": {
    "name": "my-docs-server",
    "display_name": "My Documents Server", 
    "description": "Search through my PDF collection",
    "version": "1.0.0"
  },
  "storage": {
    "pdf_folder": "./docs",
    "markdown_folder": "./docs/markdown",
    "domain_keywords": ["keyword1", "keyword2", "domain-term"]
  },
  "tools": {
    "search": {
      "name": "search_docs",
      "description": "Search through PDF documentation"
    },
    "list": {
      "name": "list_docs", 
      "description": "List all available documents"
    },
    "content": {
      "name": "get_document_content",
      "description": "Get full content from documents"
    },
    "max_results_default": 5
  },
  "processing": {
    "cache_enabled": true,
    "parallel_processing": true,
    "max_file_size_mb": 50,
    "context_size": 500
  }
}

šŸ› ļø Management Commands

Server Management

# Create new configuration
python manage_server.py create-config

# Test configuration
python manage_server.py test

# Generate MCP config
python manage_server.py generate-mcp-config

PDF Management

# List all PDFs
python manage_server.py list-pdfs

# Add PDF
python manage_server.py add-pdf document.pdf

# Remove PDF  
python manage_server.py remove-pdf document.pdf

# Process all PDFs
python manage_server.py process-pdfs

MCP Configuration

# Print MCP config
python generate_mcp_config.py

# Automatically merge with Claude Desktop config
python generate_mcp_config.py --merge

# Save to file
python generate_mcp_config.py --output my_mcp_config.json

šŸ’” Usage Examples

Legal Documents Server

{
  "server": {
    "name": "legal-docs-server",
    "display_name": "Legal Documents Server"
  },
  "storage": {
    "domain_keywords": ["contract", "liability", "jurisdiction", "plaintiff", "defendant"]
  }
}

Technical Documentation Server

{
  "server": {
    "name": "tech-docs-server", 
    "display_name": "Technical Documentation Server"
  },
  "storage": {
    "domain_keywords": ["API", "function", "class", "method", "parameter", "return"]
  }
}

Research Papers Server

{
  "server": {
    "name": "research-server",
    "display_name": "Research Papers Server"
  },
  "storage": {
    "domain_keywords": ["hypothesis", "methodology", "results", "conclusion", "analysis"]
  }
}

šŸ”§ Available MCP Tools

Each server provides three configurable tools:

  1. Search Tool (default: search_docs)

    • Intelligent search through all documents
    • TF-IDF scoring with proximity matching
    • Returns relevant excerpts with context
  2. List Tool (default: list_docs)

    • Lists all available documents
    • Shows document metadata and page counts
  3. Content Tool (default: get_document_content)

    • Retrieves full document content
    • Can fetch specific pages
    • Includes complete markdown formatting

šŸŽÆ Domain Customization

The server adapts to your domain through:

  • Domain Keywords: Configure terms important to your field
  • Tool Names: Customize tool names (e.g., search_legal_docs)
  • Descriptions: Tailor descriptions for your use case
  • Context Size: Adjust how much context to return in search results

šŸ” How the Search Engine Works

Inverted Index Architecture

The server uses an advanced inverted index for lightning-fast searches:

  1. Document Processing: PDFs are converted to markdown and tokenized
  2. Index Building: Words are mapped to their locations (document, page, position)
  3. TF-IDF Scoring:
    • TF (Term Frequency): How often a word appears in a document
    • IDF (Inverse Document Frequency): How rare a word is across all documents
    • Combined score ensures relevant, unique results rank higher

Search Features

  • Proximity Boosting: Multi-word queries score higher when terms appear close together
  • Context Extraction: Returns relevant snippets with search terms highlighted
  • Domain Keyword Recognition: Configured keywords get special treatment
  • Page-Level Precision: Results include specific page numbers
  • Smart Caching: Search index persists between server restarts

šŸ“Š Performance Optimizations

  • Incremental Processing: MD5 hash-based change detection - only new/modified PDFs are processed
  • Persistent Search Index: Pickled index loads instantly on server restart
  • Background Initialization: Server accepts connections while building index
  • Memory Efficiency: Streaming PDF processing and markdown storage
  • Configurable Limits: Control file size limits and processing parameters

šŸ› Troubleshooting

Common Issues & Solutions

Server not appearing in Claude Desktop:

  • Ensure MCP configuration was merged: python generate_mcp_config.py --merge
  • Check Python path: which python or where python (Windows)
  • Verify server_config.json exists and is valid JSON
  • Restart Claude Desktop after configuration changes

PDFs not processing:

  • Check folder permissions: ls -la /path/to/pdf/folder
  • Verify PDF files aren't corrupted: file document.pdf
  • Look for errors in stderr: python server.py 2>error.log
  • Ensure sufficient disk space for markdown cache

Search returns no/poor results:

  • Initial indexing may take time - check stderr for progress
  • Verify markdown files exist: ls markdown/*.md
  • Check search index exists: ls markdown/.search_index.pkl
  • Try single-word queries first, then expand
  • Review domain keywords in configuration

Server crashes or hangs:

  • Check Python version (3.8+ required): python --version
  • Verify all dependencies installed: pip install -r requirements.txt
  • Clear cache and reprocess: rm -rf markdown/.pdf_cache.json markdown/.search_index.pkl
  • Check for file locking issues on Windows

Debug Mode

# Run with full debug output
python server.py 2>&1 | tee debug.log

# Check server initialization
grep "initialization" debug.log

# Monitor PDF processing
grep "Processing\|Error" debug.log

Validation Commands

# Test configuration validity
python manage_server.py test

# Verify configuration loading
python -c "from config import load_config_from_env_or_file; c=load_config_from_env_or_file(); print(f'āœ“ Config loaded: {c.server.name}')"

# Check MCP integration
python generate_mcp_config.py  # Should output valid JSON

šŸš€ Advanced Usage

Multiple Servers

You can run multiple specialized servers:

# Legal documents server
python manage_server.py --config legal_config.json create-config

# Technical docs server  
python manage_server.py --config tech_config.json create-config

# Research papers server
python manage_server.py --config research_config.json create-config

Batch Processing

# Process multiple PDF folders
for folder in docs legal_docs tech_docs; do
    python convert_pdfs.py "$folder" "$folder/markdown"
done

Custom Keywords

Configure domain-specific keywords for better search relevance:

{
  "storage": {
    "domain_keywords": [
      "algorithm", "data structure", "complexity",
      "optimization", "performance", "scalability"
    ]
  }
}

šŸ—ļø Architecture Overview

Core Components

  1. SearchIndex Class (server.py:27-140)

    • Implements inverted index with TF-IDF scoring
    • Handles word tokenization and document indexing
    • Provides proximity-based ranking for multi-word queries
  2. GenericPDFServer Class (server.py:142-661)

    • Main server implementation with MCP protocol handling
    • Manages PDF processing pipeline
    • Handles async operations and background initialization
  3. Configuration System (config.py)

    • Dataclass-based type-safe configuration
    • JSON schema validation
    • Environment variable support
  4. Management CLI (manage_server.py)

    • Interactive configuration creation
    • PDF management operations
    • Server testing and validation

Data Flow

PDFs → PDF Reader → Markdown Converter → Search Index → MCP Tools → Claude
         ↓                    ↓                ↓
    [.pdf files]      [.md cache files]  [.search_index.pkl]

šŸ”„ Current Server Configuration

The repository currently includes a configuration for QuantConnect documentation (server_config.json). To create your own server:

# Option 1: Interactive setup
python manage_server.py create-config

# Option 2: Copy and modify an example
cp examples/tech_docs_config.json server_config.json
# Edit server_config.json with your settings

šŸ“š Example Use Cases

  • Legal Firms: Search through contracts, case files, and legal documents
  • Research Labs: Query scientific papers and technical reports
  • Software Teams: Access API documentation and technical specs
  • Medical Practices: Search patient records and medical literature
  • Educational Institutions: Browse course materials and textbooks

šŸ¤ Contributing

We welcome contributions! Here are some ways to help:

Enhancement Ideas

  1. Document Format Support: Add support for Word, HTML, or other formats
  2. Search Improvements: Implement semantic search, fuzzy matching, or ML-based ranking
  3. Performance: Add database backend, parallel processing, or distributed indexing
  4. Tools: Create specialized MCP tools for specific domains
  5. UI: Build a web interface for configuration management

Development Guidelines

  • Follow existing code style and patterns
  • Add tests for new functionality
  • Update documentation for new features
  • Submit PRs with clear descriptions

šŸ” Security Considerations

  • The server only has read access to specified PDF folders
  • No external network calls are made during operation
  • Sensitive data remains local - nothing is sent to external services
  • Configure appropriate file permissions for your PDF folders

šŸ“„ License

This project is open source. See LICENSE file for details.

šŸ™ Acknowledgments

Built with the Model Context Protocol by Anthropic.


Ready to transform your PDFs into a searchable knowledge base?

Run python manage_server.py create-config to get started! šŸš€

šŸ“¦ Dependencies

  • mcp: Model Context Protocol SDK for building MCP servers
  • PyPDF2: PDF parsing and text extraction
  • asyncio: Asynchronous I/O for concurrent operations
  • jsonschema: JSON validation for configuration files

All dependencies are lightweight and have minimal system requirements.