puran-water/knowledge-base-mcp
If you are the rightful owner of knowledge-base-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
The Semantic Search MCP Server is a production-grade Model Context Protocol (MCP) server designed to integrate state-of-the-art MCP clients for efficient document retrieval and ingestion.
Semantic Search MCP Server
A production-grade Model Context Protocol (MCP) server that puts a state-of-the-art MCP client (Claude Desktop/Code/Codex CLI) in the loop for both retrieval and ingestion. Agents plan with atomic MCP toolsβdeciding the extractor, chunker, reranker route, HyDE retries, graph hops, and metadata budget callsβwhile the server guarantees determinism, provenance, and thin-index security. Everything runs on local, open-source components, so embeddings and reranking stay zero-cost beyond your Claude subscription.
π Features
- Zero-Cost Embeddings & Reranking: Ollama-powered embeddings and a local TEI cross-encoder keep every document and query free of per-token fees.
- Structure-Aware Ingestion: Docling processes entire PDFs in single calls for ~60-65% faster performance, extracting tables, figures, headings with bboxes, section breadcrumbs, and element IDs. Full-document processing eliminates per-page overhead while preserving all semantic structure.
- Agent-Directed Hybrid Retrieval: Auto mode chooses among dense, hybrid, sparse, and rerank routes; when scores are low it returns an abstain so the MCP client can decide whether to run HyDE or try alternate queries. Every primitive stays exposed for manual overrides.
- Multi-Collection Support (manual): Organize documents into separate knowledge bases by adding entries to
NOMIC_KB_SCOPES; the default configuration ships with a single collection (daf_kb). - Incremental & Deterministic Ingestion: Smart update detection only reprocesses changed documents. Deterministic chunk IDs enable automatic upsert-based updates without manual cleanup.
- Graph & Entities (lightweight): Ingestion extracts equipment/chemical/parameter entities and links them back to chunks so agents can pivot via
kb.entities/kb.linkouts. (Full semantic relationship extraction is still on the roadmap.) - Operational Tooling:
scripts/manage_cache.pyandscripts/manage_collections.shhelp purge caches or manage Qdrant collections; GPU knobs (DOCLING_DEVICE,DOCLING_BATCH_SIZE) keep heavy PDFs flowing. - Canary QA Framework:
ingest.assess_qualitycan run user-supplied canary queries (config/canaries/) and report warnings before documents reach production. - Rich Provenance Tracking: Every chunk carries
plan_hash,model_version,prompt_sha, andclient_decisionsfor full auditability. Search results includepage_numbers,section_path,element_ids,table_headers,table_units, andbboxesfor precise citations and source verification. All metadata preserved in both Qdrant vector payloads and FTS database. - MCP-Controlled Upserts:
ingest.upsert,ingest.upsert_batch, andingest.corpus_upsertlet agents push chunk artifacts straight into Qdrant + FTS without leaving the MCP workflow. - Client-Orchestrated Summaries & HyDE: LLM clients contribute section summaries via
ingest.generate_summaryand generate context-aware HyDE hypotheses locally before re-queryingkb.dense, with every decision recorded in plan provenance. - Observability & Guardrails: Search logs include hashed subject IDs, stage-level timings, and top hits;
eval.pyruns gold sets with recall/nDCG/latency thresholds for CI gating. - MCP Integration: Works seamlessly with Claude Desktop, Claude Code, Codex CLI, and any MCP-compliant client.
- Agent Playbooks: Ready-to-run "retrieve β assess β refine" workflows for Claude and other MCP clients are documented in and .
Experimental / optional features such as SPLADE sparse expansion, ColBERT late interaction, automatic summaries/outlines, HyDE query expansion, and enforced canary QA require additional services or configuration. See the status table below for details.
Feature Status at a Glance
| Feature | Status | Notes |
|---|---|---|
| Core dense / hybrid / sparse retrieval | β Working | kb.search, kb.dense, kb.hybrid, kb.sparse, kb.batch |
| Entity extraction + graph link-outs | β Working | kb.entities, kb.linkouts, kb.graph (entityβchunk relationships) |
| Table retrieval | β Working (data-dependent) | Requires tables extracted during ingestion |
| Canary QA | β οΈ Requires user config | Default config/canaries/*.json are placeholders; add queries to enforce gates |
| Document summaries / outlines | β οΈ Client-provided | ingest.generate_summary stores semantic summaries with provenance; outlines still require building the heading index |
| HyDE retry | β οΈ Client-generated | There is no server tool; draft the hypothesis in your MCP client, then re-query kb.dense / kb.search with it |
| MCP upsert pipeline | β Working | ingest.upsert, ingest.upsert_batch, ingest.corpus_upsert |
| SPLADE sparse expansion | π€ Planned | --sparse-expander hooks are present but no SPLADE model is bundled by default |
| ColBERT late interaction | π€ Planned | Requires an external ColBERT service; disabled when COLBERT_URL is unset |
β
working today Β· β οΈ requires extra configuration or build steps Β· π€ planned / not yet implemented
π€ MCP-First Architecture
Conventional RAG systems hide retrieval behind a monolithic API. This server embraces the MCP client as a planner:
- Ingestion as a Toolchain β
ingest.extract_with_strategyprocesses entire PDFs with Docling in single calls (no per-page routing);ingest.chunk_with_guidanceswitches between enumerated chunkers (heading_based,procedure_block,table_row);ingest.generate_metadataenforces byte budgets and prompt hashes. Every step returns artifacts and plan hashes for replayable ingestion. - Retrieval as Composable Primitives β
kb.sparse,kb.dense,kb.hybrid,kb.rerank,kb.hint,kb.table_lookup,kb.entities,kb.linkouts,kb.batch,kb.quality, andkb.promote/demote. Use these with client-authored HyDE retries and planner heuristics to branch, retry, or fuse strategies mid-conversation. - Self-Critique with Insight β Results surface full score vectors (
bm25,dense,rrf,rerank,prior,decay) andwhyannotations (matched aliases, headers, table clues), letting the agent reason about confidence before presenting an answer.
Because Claude (or any MCP client) stays in the driver seat, you get agentic retrieval and deterministic ingestion without surrendering provenance or security.
ποΈ Architecture
ββββββββββββββββββββββββββββββββ
β MCP Clients β
β Claude Desktop / Code / CLI β
ββββββββββββββ¬ββββββββββββββββββ
β MCP Protocol (stdio)
β
ββββββββββββββββββββββββββββββββββββββββββ
β server.py (FastMCP) β
β ββββββββββββββββββββββββββββββββββββ β
β β Search Modes: β β
β β β’ Semantic (vector only) β β
β β β’ Rerank (vector + reranker) β β
β β β’ Hybrid (RRF + rerank) β β
β ββββββββββββββββββββββββββββββββββββ β
βββββββ¬βββββββββββββββ¬βββββββββββββ¬βββββββ
β β β
β β β
ββββββββββββ ββββββββββββ ββββββββββββββ
β Qdrant β β SQLite β β Ollama β
β Vector β β FTS5 β β Embeddings β
β DB β β (BM25) β ββββββββββββββ
ββββββββββββ ββββββββββββ
β
β
ββββββββββββββββ
β TEI Reranker β
β (Hugging β
β Face) β
ββββββββββββββββ
Processing Pipeline:
Documents β Extract (Docling full-document) β Chunk β Graph & Summaries β Embed β [Qdrant + SQLite FTS] β Planner β Search β Rerank β Self-Critique β Results
π Use Cases
- Engineering Documentation: Search technical manuals, specifications, and handbooks (e.g., water treatment engineering, chemical engineering)
- Legal Research: Query case law, contracts, and regulatory documents
- Medical Literature: Search research papers, clinical guidelines, and medical textbooks
- Academic Research: Build searchable libraries of papers and books
- Corporate Knowledge Bases: Make internal documentation and reports searchable
- Personal Research: Organize and query your personal document collection
π Quick Start
Prerequisites
- Docker Desktop (for Qdrant + TEI reranker)
- Ollama (for embeddings)
- Python 3.9+
- Optional but recommended: set
HF_HOMEto a writable folder (e.g.,export HF_HOME="$PWD/.cache/hf") so Docling can cache layout models when triage routes a page to structured extraction.
Installation
- Clone the repository:
git clone https://github.com/yourusername/knowledge-base-mcp.git
cd knowledge-base-mcp
-
Install Ollama (if not already installed):
- Visit ollama.com and download for your platform
- Pull the embedding model:
ollama pull snowflake-arctic-embed:xs -
Start Docker services (Qdrant + TEI Reranker):
docker-compose up -d
- Set up Python environment:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -r requirements.txt
- Configure your MCP client:
Choose the configuration for your MCP client:
For Claude Code:
cp .mcp.json.example .mcp.json
# Edit .mcp.json and update the Python venv path
For Claude Desktop:
- Copy
claude_desktop_config.json.examplecontents to:- macOS:
~/Library/Application Support/Claude/claude_desktop_config.json - Windows:
%APPDATA%\Claude\claude_desktop_config.json
- macOS:
- Update the paths to match your system (use
wslcommand on Windows)
For Codex CLI:
- Use
.codex/config.toml.exampleas a template
- Ingest your first documents:
.venv/bin/python3 ingest.py \
--root /path/to/your/documents \
--qdrant-collection my_docs \
--max-chars 700 \
--batch-size 128 \
--parallel 1 \
--ollama-threads 4 \
--fts-db data/my_docs_fts.db \
--fts-rebuild \
--max-file-mb 100
Important CLI Parameters:
--max-file-mb 100: Maximum file size to process (default: 64MB). Increase for large handbooks/textbooks.--fts-db: MUST match collection name (e.g.,data/my_docs_fts.dbfor--qdrant-collection my_docs)--max-chars 700: Recommended chunk size for reranker compatibility (old default: 1800)--batch-size 128: Embedding batch size (new default: 128 vs old default: 32)--fts-rebuild: Rebuild FTS database from scratch (omit for incremental updates)- Document timeouts have been removed - all documents process to completion regardless of size
- Test the search:
python validate_search.py \
--query "your search query" \
--collection my_docs \
--mode hybrid
- Optionally run the evaluation harness (fails CI if thresholds are missed):
python eval.py \
--gold eval/gold_sets/my_docs.jsonl \
--mode auto \
--fts-db data/my_docs_fts.db \
--min-ndcg 0.85 --min-recall 0.80 --max-latency 3000
- Optional: add canary QA queries to
config/canaries/soingest.assess_qualitycan enforce pass/fail gates instead of only reporting warnings.
See for detailed setup instructions.
π Search Modes
Semantic Search (mode="semantic")
Pure vector similarity search using dense embeddings.
Best for: Conceptual queries, finding related content even without exact keyword matches
Example: "How do biological systems remove nitrogen?" will find relevant content even if it uses terms like "nitrification" or "denitrification"
Rerank Search (mode="rerank", default)
Vector retrieval followed by cross-encoder reranking.
Best for: Most use cases - good balance of speed and accuracy
Example: Standard searches where you want better precision than pure vector search
Hybrid Search (mode="hybrid")
Combines vector search + BM25 lexical search using RRF, then reranks.
Best for: Complex queries with both conceptual and specific keyword requirements
Example: "stainless steel 316L corrosion in chloride environments" benefits from both semantic understanding and exact term matching
Sparse Search (mode="sparse")
Runs alias-aware BM25 only, useful for short keyword queries or as a fallback when semantic routes miss exact terminology.
Auto Planner (mode="auto")
Default. Heuristics pick among semantic, hybrid, rerank, and sparse routes. When the top score falls below ANSWERABILITY_THRESHOLD, the server abstains and returns telemetry so the MCP client (Claude, etc.) can decide whether to generate a HyDE hypothesis and retry with kb.dense or kb.sparse.
Additional MCP Tools
kb.collectionsβ list configured collection slugs and their backing indices.kb.open(collection="slug", chunk_id=...)β fetch a chunk bychunk_idorelement_id, optionally slicing by char offsets.kb.neighbors(collection="slug", chunk_id=...)β pull FTS neighbors around a chunk for more context.kb.summary(collection="slug", topic=...)β retrieve lightweight section summaries (RAPTOR-style) built during ingest.kb.graph(node_id=...)β inspect the lightweight graph (doc β section β chunk β entity) generated from structured metadata.
π Search Parameters
| Parameter | Default | Description |
|---|---|---|
query | β | Search query text |
mode | rerank | semantic, rerank, hybrid, sparse, or auto |
top_k | 8 | Final results returned (1β100) |
retrieve_k | 24 | Initial candidate pool (1β256) |
return_k | 8 | Post-rerank results (β€ retrieve_k) |
n (for kb.neighbors) | 10 (recommended) | Neighbor radius for context expansion - MANDATORY for comprehensive answers |
Parameter Tuning Guide
| Scenario | retrieve_k | return_k | top_k | mode |
|---|---|---|---|---|
| Quick search | 12 | 8 | 5 | rerank |
| Comprehensive | 48 | 16 | 10 | hybrid |
| High precision | 24 | 12 | 5 | hybrid |
| Exploratory | 32 | 12 | 8 | semantic |
Configuration knobs
Set via environment variables (or CLI flags when available):
| Environment variable | Purpose |
|---|---|
MIX_W_BM25, MIX_W_DENSE, MIX_W_RERANK | Adjust blend between lexical, dense, and rerank signals. |
HF_HOME | Hugging Face cache directory used by Docling models (default .cache/hf). |
GRAPH_DB_PATH, SUMMARY_DB_PATH | Override lightweight graph and summary storage locations. |
ANSWERABILITY_THRESHOLD | Minimum score required for auto mode to respond; lower scores return an abstain for the client to handle (e.g., run HyDE or rephrase). |
| Context Retrieval | CRITICAL: Always use kb.neighbors(n=10) after search - single chunks are insufficient at chunk size 700. |
Note: Document timeouts have been removed. All documents process to completion regardless of size.
π Usage
See for comprehensive documentation including:
- Ingestion parameters and examples
- Multi-collection setup
- Advanced search features (neighbor expansion, time decay)
- Incremental ingestion patterns
- Performance tuning
For upcoming improvements, check .
π Further Reading
- β MCP agent prompt for Claude Desktop/Code with ingestion and retrieval workflows.
- β MCP agent prompt for Codex CLI.
- β deep dive into RRF, reranking, and chunking design choices.
- β planned features including SPLADE sparse expansion and ColBERT late interaction.
ποΈ Architecture Details
See for deep dive into:
- Reciprocal Rank Fusion (RRF) algorithm
- Cross-encoder reranking strategy
- Neighbor context expansion
- Embedding model selection rationale
- Chunking strategy
π οΈ Configuration
Environment Variables
All configuration can be customized via environment variables. See for full documentation.
Key variables:
OLLAMA_MODEL: Embedding model (default:snowflake-arctic-embed:xs)QDRANT_URL: Qdrant server (default:http://localhost:6333)TEI_RERANK_URL: Reranker endpoint (default:http://localhost:8087)HYBRID_RRF_K: RRF parameter (default: 60)NEIGHBOR_CHUNKS: Context expansion (default: 1)
Multi-Collection Setup
Configure multiple knowledge bases with independent search tools:
{
"NOMIC_KB_SCOPES": "{
\"technical_docs\": {
\"collection\": \"engineering_kb\",
\"title\": \"Engineering Documentation\"
},
\"legal_docs\": {
\"collection\": \"legal_kb\",
\"title\": \"Legal Research\"
}
}"
}
This creates two MCP tools: search_technical_docs and search_legal_docs.
π§ Troubleshooting
Services not starting?
# Check Docker services
docker-compose ps
# Check Ollama
curl http://localhost:11434/api/tags
# Check Qdrant
curl http://localhost:6333/collections
Embeddings failing?
- Ensure Ollama model is pulled:
ollama list - Check Ollama is running:
ollama serve(usually auto-starts) - Try reducing batch size: Add
--embed-batch-size 16to ingest command
Search returning no results?
- Verify collection name matches ingestion
- Check Qdrant collection exists:
curl http://localhost:6333/collections/{collection_name} - Confirm FTS database path is correct
See for more common issues.
π Performance
Hardware recommendations:
- Minimum: 4GB RAM, 2 CPU cores
- Recommended: 8GB RAM, 4 CPU cores
- Optimal: 16GB RAM, 8+ CPU cores
Benchmarks (approximate, varies by hardware):
- Ingestion: 5-10 pages/second (with Ollama embeddings)
- Search latency: 100-300ms (hybrid mode with reranking)
- Storage: ~2-3KB per chunk (vector + payload)
π€ Contributing
Contributions welcome! Please see for guidelines.
π License
MIT License - see file for details.
π Acknowledgments
Built with:
- FastMCP - MCP server framework
- Qdrant - Vector database
- Ollama - Local LLM and embeddings
- Hugging Face TEI - Reranking
- MarkItDown - Document extraction
- Docling - High-fidelity PDF processing
π Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: See docs in this repository
Why local embeddings and reranking?
- π° Zero Additional Cost: No per-document embedding fees, no per-query reranking charges - only Claude subscription
- π Unlimited Scale: Ingest and search unlimited documents without incremental costs
- β‘ Fast: Local search with <300ms latency - no API roundtrips for embeddings or reranking
- π― Control: Full customization of embedding models, search parameters, and chunking strategy