concept-rag

m2ux/concept-rag

3.3

If you are the rightful owner of concept-rag and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.

Concept RAG is a powerful MCP server that enables LLMs to interact with documents through conceptual search, combining corpus-driven concept extraction, WordNet semantic enrichment, and multi-signal hybrid ranking for superior retrieval accuracy.

Tools
3
Resources
0
Prompts
0

🧠 Conceptual-Search RAG Server

Node.js 18+ License: MIT MCP Compatible TypeScript Python 3.9+

A RAG MCP server that enables LLMs to interact with local PDF/EPUB documents through conceptual search. Combines corpus-driven concept extraction, WordNet semantic enrichment, and multi-signal hybrid ranking powered by LanceDB for superior retrieval accuracy.


🚀 Quick Start⚙️ 📖 💡 🛠️


📝 Available Tools

The server provides 11 specialized MCP tools organized into four categories:

Document Discovery

ToolDescriptionExample Query
catalog_searchSemantic search documents by topic, title, or author"software architecture patterns"
category_searchBrowse documents by category/domain"software engineering"
list_categoriesList all categories in your library(no query required)

Content Search

ToolDescriptionExample Query
broad_chunks_searchCross-document search (phrases, keywords, topics)"implementing dependency injection"
chunks_searchSearch within a specific known document"SOLID principles" + source path

Concept Analysis

ToolDescriptionExample Query
concept_searchFind chunks by concept with fuzzy matching"design patterns"
extract_conceptsExport all concepts from a document"Clean Architecture"
source_conceptsFind documents where concept(s) appear (union)["TDD", "BDD"] → all docs with either
concept_sourcesGet per-concept source lists (separate arrays)["TDD", "BDD"] → sources for each
list_concepts_in_categoryFind concepts in a category"distributed systems"

Agent Guidance

ToolDescriptionExample Query
get_guidanceGet research rules and tool selection guidancetopic: "rules" or "tool-selection"

📖 Full API documentation: See for complete JSON I/O schemas.

For AI agents: See for the decision tree and selection guidance.

Agent Integration

To improve AI agent performance when using these tools, integrate the agent rules into your project:

ResourcePurpose
Essential rules for efficient tool use and answer synthesis
Decision tree for selecting the right tool

Recommended: Include agent-quick-rules.md in your agent's system prompt or context to ensure:

  • Proper tool selection workflow (catalog_searchchunks_search)
  • Efficient stopping criteria (4-6 tool calls maximum)
  • Synthesized answers instead of search narration

🚀 Quick Start

Prerequisites

  • Node.js 18+
  • Python 3.9+ with NLTK
  • OpenRouter API key (sign up here)
  • MCP Client (Cursor or Claude Desktop)

Installation

# Clone and build
git clone https://github.com/m2ux/concept-rag.git
cd concept-rag
npm install
npm run build

# Install WordNet
pip3 install nltk
python3 -c "import nltk; nltk.download('wordnet'); nltk.download('omw-1.4')"

# Configure API key
cp .env.example .env
# Edit .env and add your OpenRouter API key

Seed Your Documents

# Set environment
source .env

# Initial seeding (create database)
npx tsx hybrid_fast_seed.ts \
  --dbpath ~/.concept_rag \
  --filesdir ~/Documents/my-pdfs \
  --overwrite

# Incremental seeding (add new documents only - much faster!)
npx tsx hybrid_fast_seed.ts \
  --dbpath ~/.concept_rag \
  --filesdir ~/Documents/my-pdfs

Seeding Options:

FlagDescription
--filesdirDirectory containing PDF/EPUB files (required)
--dbpathDatabase path (default: ~/.concept_rag)
--overwriteDrop and recreate all database tables
--parallel NProcess N documents concurrently (default: 10, max: 25)
--resumeSkip documents already in checkpoint (for interrupted runs)
--clean-checkpointClear checkpoint file and start fresh
--rebuild-conceptsRebuild concept index even if no new documents
--auto-reseedRe-process documents with incomplete metadata
--max-docs NProcess at most N new documents (for batching)
--with-wordnetEnable WordNet enrichment (disabled by default)

📝 Automatic Logging: Each run creates a timestamped log file in logs/seed-YYYY-MM-DDTHH-MM-SS.log for troubleshooting and audit trails.

📊 Progress Tracking: Real-time progress bars show document processing, concept extraction, and index building stages.

To seed specific documents

# Seed specific documents by hash prefix (shown in seeding output)
npx tsx scripts/seed_specific.ts --hash 3cde 7f2b

# Or by filename pattern
npx tsx scripts/seed_specific.ts --pattern "Transaction Processing"

Maintenance Scripts

# Health check - verify database integrity
npx tsx scripts/health-check.ts

# Rebuild derived name fields (after schema changes)
npx tsx scripts/rebuild_derived_names.ts --dbpath ~/.concept_rag

# Link related concepts (lexical similarity)
npx tsx scripts/link_related_concepts.ts --dbpath ~/.concept_rag

# Analyze backup differences
npx tsx scripts/analyze-backups.ts backup1/ backup2/

See for all maintenance utilities.

Configure MCP Client

Cursor (~/.cursor/mcp.json):

{
  "mcpServers": {
    "concept-rag": {
      "command": "node",
      "args": [
        "/path/to/concept-rag/dist/conceptual_index.js",
        "/home/username/.concept_rag"
      ]
    }
  }
}

Restart your MCP client and start searching!

📖 For complete setup instructions for various IDEs, see

🛠️ Development

Project Structure

src/
├── conceptual_index.ts           # MCP server entry point
├── application/                  # Composition root (DI)
├── domain/                       # Domain models, services, interfaces
│   ├── models/                   # Chunk, Concept, SearchResult
│   ├── services/                 # Domain services (search logic)
│   └── interfaces/               # Repository and service interfaces
├── infrastructure/               # External integrations
│   ├── lancedb/                  # Database adapters (normalized schema v7)
│   ├── embeddings/               # Embedding service
│   ├── search/                   # Hybrid search with 4-signal scoring
│   ├── resilience/               # Circuit breaker, bulkhead, timeout patterns
│   ├── checkpoint/               # Resumable seeding with progress tracking
│   ├── cli/                      # Progress bar display utilities
│   └── document-loaders/         # PDF, EPUB loaders with OCR fallback
├── concepts/                     # Concept extraction & indexing
│   ├── concept_extractor.ts      # LLM-based extraction
│   ├── parallel-concept-extractor.ts  # Concurrent document processing
│   ├── concept_index.ts          # Index builder with lexical linking
│   ├── query_expander.ts         # Query expansion with WordNet
│   └── summary_generator.ts      # LLM summary generation
├── wordnet/                      # WordNet integration
└── tools/                        # MCP tools (10 operations)

scripts/
├── health-check.ts               # Database integrity verification
├── rebuild_derived_names.ts      # Regenerate derived text fields
├── link_related_concepts.ts      # Build concept relationship graph
├── seed_specific.ts              # Targeted document re-seeding
└── analyze-backups.ts            # Backup comparison and analysis

🏗️ Architecture

     PDF/EPUB Documents
            ↓
   Processing + OCR fallback
            ↓
  ┌─────────┼─────────┐
  ↓         ↓         ↓
Catalog   Chunks   Concepts   Categories
(docs)    (text)   (index)    (taxonomy)
  └─────────┴─────────┴─────────┘
            ↓
    Hybrid Search Engine
   (Vector + BM25 + Concepts + WordNet)

Four-Table Normalized Schema:

  • Catalog: Document metadata with derived concept_names, category_names
  • Chunks: Text segments with catalog_title, concept_names
  • Concepts: Deduplicated index with lexical/adjacent relationships
  • Categories: Hierarchical taxonomy with statistics

See for complete schema documentation.

🎨 Design

This project follows well-documented architectural principles and design decisions:

Architecture Decision Records (ADRs)

All major technical decisions are documented in .

Documentation

  • - Complete MCP tool documentation with JSON I/O schemas
  • - Decision tree and usage guidance for AI agents
  • - Four-table normalized schema with derived fields
  • - Comprehensive test documentation with links to all E2E, integration, unit, and property tests

💬 Support & Community

🙏 Acknowledgments

This project is forked from lance-mcp by adiom-data. The original project provided the foundational MCP server architecture and LanceDB integration.

This fork extends the original with:

  • 🔄 Recursive self-improvement - Used its own tools to discover and apply design patterns
  • 📚 Formal concept model - Rigorous definition ensuring semantic matching and disambiguation
  • 🧠 Enhanced concept extraction - 80-150+ concepts per document (Claude Sonnet 4.5)
  • 🌐 WordNet semantic enrichment - Synonym expansion and hierarchical navigation
  • 🔍 Multi-signal hybrid ranking - Vector + BM25 + title + concept + WordNet (4-signal scoring)
  • 📖 Large document support - Multi-pass extraction for >100k token documents
  • Parallel concept extraction - Process up to 25 documents concurrently with shared rate limiting
  • 🔁 Resumable seeding - Checkpoint-based recovery from interrupted runs
  • 🛡️ System resilience - Circuit breaker, bulkhead, and timeout patterns for external services
  • 📊 Normalized schema (v7) - Derived text fields eliminate ID cache lookups at runtime
  • 🔗 Concept relationships - Adjacent (co-occurrence) and related (lexical) concept linking
  • 🏥 Health checks - Database integrity verification with detailed reporting
  • 🏗️ Clean Architecture - Domain-Driven Design patterns throughout (see )

We're grateful to the original author for creating and open-sourcing this excellent foundation!

Contributing

We welcome contributions! See for:

  • Development setup
  • Code style guidelines
  • Pull request process
  • Areas needing help

📜 License

This project is licensed under the MIT License - see the file for details.