knowledge-base-mcp

puran-water/knowledge-base-mcp

3.3

If you are the rightful owner of knowledge-base-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

The Semantic Search MCP Server is a production-grade Model Context Protocol (MCP) server designed to integrate state-of-the-art MCP clients for efficient document retrieval and ingestion.

Tools
5
Resources
0
Prompts
0

Semantic Search MCP Server

A production-grade Model Context Protocol (MCP) server that puts a state-of-the-art MCP client (Claude Desktop/Code/Codex CLI) in the loop for both retrieval and ingestion. Agents plan with atomic MCP toolsβ€”deciding the extractor, chunker, reranker route, HyDE retries, graph hops, and metadata budget callsβ€”while the server guarantees determinism, provenance, and thin-index security. Everything runs on local, open-source components, so embeddings and reranking stay zero-cost beyond your Claude subscription.

🌟 Features

  • Zero-Cost Embeddings & Reranking: Ollama-powered embeddings and a local TEI cross-encoder keep every document and query free of per-token fees.
  • Structure-Aware Ingestion: Docling processes entire PDFs in single calls for ~60-65% faster performance, extracting tables, figures, headings with bboxes, section breadcrumbs, and element IDs. Full-document processing eliminates per-page overhead while preserving all semantic structure.
  • Agent-Directed Hybrid Retrieval: Auto mode chooses among dense, hybrid, sparse, and rerank routes; when scores are low it returns an abstain so the MCP client can decide whether to run HyDE or try alternate queries. Every primitive stays exposed for manual overrides.
  • Multi-Collection Support (manual): Organize documents into separate knowledge bases by adding entries to NOMIC_KB_SCOPES; the default configuration ships with a single collection (daf_kb).
  • Incremental & Deterministic Ingestion: Smart update detection only reprocesses changed documents. Deterministic chunk IDs enable automatic upsert-based updates without manual cleanup.
  • Graph & Entities (lightweight): Ingestion extracts equipment/chemical/parameter entities and links them back to chunks so agents can pivot via kb.entities/kb.linkouts. (Full semantic relationship extraction is still on the roadmap.)
  • Operational Tooling: scripts/manage_cache.py and scripts/manage_collections.sh help purge caches or manage Qdrant collections; GPU knobs (DOCLING_DEVICE, DOCLING_BATCH_SIZE) keep heavy PDFs flowing.
  • Canary QA Framework: ingest.assess_quality can run user-supplied canary queries (config/canaries/) and report warnings before documents reach production.
  • Rich Provenance Tracking: Every chunk carries plan_hash, model_version, prompt_sha, and client_decisions for full auditability. Search results include page_numbers, section_path, element_ids, table_headers, table_units, and bboxes for precise citations and source verification. All metadata preserved in both Qdrant vector payloads and FTS database.
  • MCP-Controlled Upserts: ingest.upsert, ingest.upsert_batch, and ingest.corpus_upsert let agents push chunk artifacts straight into Qdrant + FTS without leaving the MCP workflow.
  • Client-Orchestrated Summaries & HyDE: LLM clients contribute section summaries via ingest.generate_summary and generate context-aware HyDE hypotheses locally before re-querying kb.dense, with every decision recorded in plan provenance.
  • Observability & Guardrails: Search logs include hashed subject IDs, stage-level timings, and top hits; eval.py runs gold sets with recall/nDCG/latency thresholds for CI gating.
  • MCP Integration: Works seamlessly with Claude Desktop, Claude Code, Codex CLI, and any MCP-compliant client.
  • Agent Playbooks: Ready-to-run "retrieve β†’ assess β†’ refine" workflows for Claude and other MCP clients are documented in and .

Experimental / optional features such as SPLADE sparse expansion, ColBERT late interaction, automatic summaries/outlines, HyDE query expansion, and enforced canary QA require additional services or configuration. See the status table below for details.

Feature Status at a Glance

FeatureStatusNotes
Core dense / hybrid / sparse retrievalβœ… Workingkb.search, kb.dense, kb.hybrid, kb.sparse, kb.batch
Entity extraction + graph link-outsβœ… Workingkb.entities, kb.linkouts, kb.graph (entityβ†’chunk relationships)
Table retrievalβœ… Working (data-dependent)Requires tables extracted during ingestion
Canary QA⚠️ Requires user configDefault config/canaries/*.json are placeholders; add queries to enforce gates
Document summaries / outlines⚠️ Client-providedingest.generate_summary stores semantic summaries with provenance; outlines still require building the heading index
HyDE retry⚠️ Client-generatedThere is no server tool; draft the hypothesis in your MCP client, then re-query kb.dense / kb.search with it
MCP upsert pipelineβœ… Workingingest.upsert, ingest.upsert_batch, ingest.corpus_upsert
SPLADE sparse expansionπŸ’€ Planned--sparse-expander hooks are present but no SPLADE model is bundled by default
ColBERT late interactionπŸ’€ PlannedRequires an external ColBERT service; disabled when COLBERT_URL is unset

βœ… working today Β· ⚠️ requires extra configuration or build steps Β· πŸ’€ planned / not yet implemented

πŸ€– MCP-First Architecture

Conventional RAG systems hide retrieval behind a monolithic API. This server embraces the MCP client as a planner:

  • Ingestion as a Toolchain – ingest.extract_with_strategy processes entire PDFs with Docling in single calls (no per-page routing); ingest.chunk_with_guidance switches between enumerated chunkers (heading_based, procedure_block, table_row); ingest.generate_metadata enforces byte budgets and prompt hashes. Every step returns artifacts and plan hashes for replayable ingestion.
  • Retrieval as Composable Primitives – kb.sparse, kb.dense, kb.hybrid, kb.rerank, kb.hint, kb.table_lookup, kb.entities, kb.linkouts, kb.batch, kb.quality, and kb.promote/demote. Use these with client-authored HyDE retries and planner heuristics to branch, retry, or fuse strategies mid-conversation.
  • Self-Critique with Insight – Results surface full score vectors (bm25, dense, rrf, rerank, prior, decay) and why annotations (matched aliases, headers, table clues), letting the agent reason about confidence before presenting an answer.

Because Claude (or any MCP client) stays in the driver seat, you get agentic retrieval and deterministic ingestion without surrendering provenance or security.

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  MCP Clients                 β”‚
β”‚  Claude Desktop / Code / CLI β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚ MCP Protocol (stdio)
         ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         server.py (FastMCP)            β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Search Modes:                   β”‚  β”‚
β”‚  β”‚  β€’ Semantic (vector only)        β”‚  β”‚
β”‚  β”‚  β€’ Rerank (vector + reranker)    β”‚  β”‚
β”‚  β”‚  β€’ Hybrid (RRF + rerank)         β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
      β”‚              β”‚            β”‚
      ↓              ↓            ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Qdrant  β”‚  β”‚ SQLite   β”‚  β”‚  Ollama    β”‚
β”‚  Vector  β”‚  β”‚  FTS5    β”‚  β”‚ Embeddings β”‚
β”‚   DB     β”‚  β”‚  (BM25)  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      β”‚
      ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ TEI Reranker β”‚
β”‚ (Hugging     β”‚
β”‚  Face)       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Processing Pipeline:

Documents β†’ Extract (Docling full-document) β†’ Chunk β†’ Graph & Summaries β†’ Embed β†’ [Qdrant + SQLite FTS] β†’ Planner β†’ Search β†’ Rerank β†’ Self-Critique β†’ Results

πŸ“‹ Use Cases

  • Engineering Documentation: Search technical manuals, specifications, and handbooks (e.g., water treatment engineering, chemical engineering)
  • Legal Research: Query case law, contracts, and regulatory documents
  • Medical Literature: Search research papers, clinical guidelines, and medical textbooks
  • Academic Research: Build searchable libraries of papers and books
  • Corporate Knowledge Bases: Make internal documentation and reports searchable
  • Personal Research: Organize and query your personal document collection

πŸš€ Quick Start

Prerequisites

  • Docker Desktop (for Qdrant + TEI reranker)
  • Ollama (for embeddings)
  • Python 3.9+
  • Optional but recommended: set HF_HOME to a writable folder (e.g., export HF_HOME="$PWD/.cache/hf") so Docling can cache layout models when triage routes a page to structured extraction.

Installation

  1. Clone the repository:
git clone https://github.com/yourusername/knowledge-base-mcp.git
cd knowledge-base-mcp
  1. Install Ollama (if not already installed):

    • Visit ollama.com and download for your platform
    • Pull the embedding model:
    ollama pull snowflake-arctic-embed:xs
    
  2. Start Docker services (Qdrant + TEI Reranker):

docker-compose up -d
  1. Set up Python environment:
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r requirements.txt
  1. Configure your MCP client:

Choose the configuration for your MCP client:

For Claude Code:

cp .mcp.json.example .mcp.json
# Edit .mcp.json and update the Python venv path

For Claude Desktop:

  • Copy claude_desktop_config.json.example contents to:
    • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
    • Windows: %APPDATA%\Claude\claude_desktop_config.json
  • Update the paths to match your system (use wsl command on Windows)

For Codex CLI:

  • Use .codex/config.toml.example as a template
  1. Ingest your first documents:
.venv/bin/python3 ingest.py \
  --root /path/to/your/documents \
  --qdrant-collection my_docs \
  --max-chars 700 \
  --batch-size 128 \
  --parallel 1 \
  --ollama-threads 4 \
  --fts-db data/my_docs_fts.db \
  --fts-rebuild \
  --max-file-mb 100

Important CLI Parameters:

  • --max-file-mb 100: Maximum file size to process (default: 64MB). Increase for large handbooks/textbooks.
  • --fts-db: MUST match collection name (e.g., data/my_docs_fts.db for --qdrant-collection my_docs)
  • --max-chars 700: Recommended chunk size for reranker compatibility (old default: 1800)
  • --batch-size 128: Embedding batch size (new default: 128 vs old default: 32)
  • --fts-rebuild: Rebuild FTS database from scratch (omit for incremental updates)
  • Document timeouts have been removed - all documents process to completion regardless of size
  1. Test the search:
python validate_search.py \
  --query "your search query" \
  --collection my_docs \
  --mode hybrid
  1. Optionally run the evaluation harness (fails CI if thresholds are missed):
python eval.py \
  --gold eval/gold_sets/my_docs.jsonl \
  --mode auto \
  --fts-db data/my_docs_fts.db \
  --min-ndcg 0.85 --min-recall 0.80 --max-latency 3000
  1. Optional: add canary QA queries to config/canaries/ so ingest.assess_quality can enforce pass/fail gates instead of only reporting warnings.

See for detailed setup instructions.

πŸ” Search Modes

Semantic Search (mode="semantic")

Pure vector similarity search using dense embeddings.

Best for: Conceptual queries, finding related content even without exact keyword matches

Example: "How do biological systems remove nitrogen?" will find relevant content even if it uses terms like "nitrification" or "denitrification"

Rerank Search (mode="rerank", default)

Vector retrieval followed by cross-encoder reranking.

Best for: Most use cases - good balance of speed and accuracy

Example: Standard searches where you want better precision than pure vector search

Hybrid Search (mode="hybrid")

Combines vector search + BM25 lexical search using RRF, then reranks.

Best for: Complex queries with both conceptual and specific keyword requirements

Example: "stainless steel 316L corrosion in chloride environments" benefits from both semantic understanding and exact term matching

Sparse Search (mode="sparse")

Runs alias-aware BM25 only, useful for short keyword queries or as a fallback when semantic routes miss exact terminology.

Auto Planner (mode="auto")

Default. Heuristics pick among semantic, hybrid, rerank, and sparse routes. When the top score falls below ANSWERABILITY_THRESHOLD, the server abstains and returns telemetry so the MCP client (Claude, etc.) can decide whether to generate a HyDE hypothesis and retry with kb.dense or kb.sparse.

Additional MCP Tools

  • kb.collections – list configured collection slugs and their backing indices.
  • kb.open(collection="slug", chunk_id=...) – fetch a chunk by chunk_id or element_id, optionally slicing by char offsets.
  • kb.neighbors(collection="slug", chunk_id=...) – pull FTS neighbors around a chunk for more context.
  • kb.summary(collection="slug", topic=...) – retrieve lightweight section summaries (RAPTOR-style) built during ingest.
  • kb.graph(node_id=...) – inspect the lightweight graph (doc β†’ section β†’ chunk β†’ entity) generated from structured metadata.

πŸ“Š Search Parameters

ParameterDefaultDescription
queryβ€”Search query text
modereranksemantic, rerank, hybrid, sparse, or auto
top_k8Final results returned (1–100)
retrieve_k24Initial candidate pool (1–256)
return_k8Post-rerank results (≀ retrieve_k)
n (for kb.neighbors)10 (recommended)Neighbor radius for context expansion - MANDATORY for comprehensive answers

Parameter Tuning Guide

Scenarioretrieve_kreturn_ktop_kmode
Quick search1285rerank
Comprehensive481610hybrid
High precision24125hybrid
Exploratory32128semantic

Configuration knobs

Set via environment variables (or CLI flags when available):

Environment variablePurpose
MIX_W_BM25, MIX_W_DENSE, MIX_W_RERANKAdjust blend between lexical, dense, and rerank signals.
HF_HOMEHugging Face cache directory used by Docling models (default .cache/hf).
GRAPH_DB_PATH, SUMMARY_DB_PATHOverride lightweight graph and summary storage locations.
ANSWERABILITY_THRESHOLDMinimum score required for auto mode to respond; lower scores return an abstain for the client to handle (e.g., run HyDE or rephrase).
Context RetrievalCRITICAL: Always use kb.neighbors(n=10) after search - single chunks are insufficient at chunk size 700.

Note: Document timeouts have been removed. All documents process to completion regardless of size.

πŸ“š Usage

See for comprehensive documentation including:

  • Ingestion parameters and examples
  • Multi-collection setup
  • Advanced search features (neighbor expansion, time decay)
  • Incremental ingestion patterns
  • Performance tuning

For upcoming improvements, check .

πŸ“˜ Further Reading

  • – MCP agent prompt for Claude Desktop/Code with ingestion and retrieval workflows.
  • – MCP agent prompt for Codex CLI.
  • – deep dive into RRF, reranking, and chunking design choices.
  • – planned features including SPLADE sparse expansion and ColBERT late interaction.

πŸ›οΈ Architecture Details

See for deep dive into:

  • Reciprocal Rank Fusion (RRF) algorithm
  • Cross-encoder reranking strategy
  • Neighbor context expansion
  • Embedding model selection rationale
  • Chunking strategy

πŸ› οΈ Configuration

Environment Variables

All configuration can be customized via environment variables. See for full documentation.

Key variables:

  • OLLAMA_MODEL: Embedding model (default: snowflake-arctic-embed:xs)
  • QDRANT_URL: Qdrant server (default: http://localhost:6333)
  • TEI_RERANK_URL: Reranker endpoint (default: http://localhost:8087)
  • HYBRID_RRF_K: RRF parameter (default: 60)
  • NEIGHBOR_CHUNKS: Context expansion (default: 1)

Multi-Collection Setup

Configure multiple knowledge bases with independent search tools:

{
  "NOMIC_KB_SCOPES": "{
    \"technical_docs\": {
      \"collection\": \"engineering_kb\",
      \"title\": \"Engineering Documentation\"
    },
    \"legal_docs\": {
      \"collection\": \"legal_kb\",
      \"title\": \"Legal Research\"
    }
  }"
}

This creates two MCP tools: search_technical_docs and search_legal_docs.

πŸ”§ Troubleshooting

Services not starting?

# Check Docker services
docker-compose ps

# Check Ollama
curl http://localhost:11434/api/tags

# Check Qdrant
curl http://localhost:6333/collections

Embeddings failing?

  • Ensure Ollama model is pulled: ollama list
  • Check Ollama is running: ollama serve (usually auto-starts)
  • Try reducing batch size: Add --embed-batch-size 16 to ingest command

Search returning no results?

  • Verify collection name matches ingestion
  • Check Qdrant collection exists: curl http://localhost:6333/collections/{collection_name}
  • Confirm FTS database path is correct

See for more common issues.

πŸ“ˆ Performance

Hardware recommendations:

  • Minimum: 4GB RAM, 2 CPU cores
  • Recommended: 8GB RAM, 4 CPU cores
  • Optimal: 16GB RAM, 8+ CPU cores

Benchmarks (approximate, varies by hardware):

  • Ingestion: 5-10 pages/second (with Ollama embeddings)
  • Search latency: 100-300ms (hybrid mode with reranking)
  • Storage: ~2-3KB per chunk (vector + payload)

🀝 Contributing

Contributions welcome! Please see for guidelines.

πŸ“„ License

MIT License - see file for details.

πŸ™ Acknowledgments

Built with:

πŸ“ž Support


Why local embeddings and reranking?

  • πŸ’° Zero Additional Cost: No per-document embedding fees, no per-query reranking charges - only Claude subscription
  • πŸ“ˆ Unlimited Scale: Ingest and search unlimited documents without incremental costs
  • ⚑ Fast: Local search with <300ms latency - no API roundtrips for embeddings or reranking
  • 🎯 Control: Full customization of embedding models, search parameters, and chunking strategy