m2ux/concept-rag
If you are the rightful owner of concept-rag and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.
Concept RAG is a powerful MCP server that enables LLMs to interact with documents through conceptual search, combining corpus-driven concept extraction, WordNet semantic enrichment, and multi-signal hybrid ranking for superior retrieval accuracy.
🧠 Conceptual-Search RAG Server
A RAG MCP server that enables LLMs to interact with local PDF/EPUB documents through conceptual search. Combines corpus-driven concept extraction, WordNet semantic enrichment, and multi-signal hybrid ranking powered by LanceDB for superior retrieval accuracy.
🚀 Quick Start • ⚙️ • 📖 • 💡 • ❓ • 🛠️
📝 Available Tools
The server provides 11 specialized MCP tools organized into four categories:
Document Discovery
| Tool | Description | Example Query |
|---|---|---|
catalog_search | Semantic search documents by topic, title, or author | "software architecture patterns" |
category_search | Browse documents by category/domain | "software engineering" |
list_categories | List all categories in your library | (no query required) |
Content Search
| Tool | Description | Example Query |
|---|---|---|
broad_chunks_search | Cross-document search (phrases, keywords, topics) | "implementing dependency injection" |
chunks_search | Search within a specific known document | "SOLID principles" + source path |
Concept Analysis
| Tool | Description | Example Query |
|---|---|---|
concept_search | Find chunks by concept with fuzzy matching | "design patterns" |
extract_concepts | Export all concepts from a document | "Clean Architecture" |
source_concepts | Find documents where concept(s) appear (union) | ["TDD", "BDD"] → all docs with either |
concept_sources | Get per-concept source lists (separate arrays) | ["TDD", "BDD"] → sources for each |
list_concepts_in_category | Find concepts in a category | "distributed systems" |
Agent Guidance
| Tool | Description | Example Query |
|---|---|---|
get_guidance | Get research rules and tool selection guidance | topic: "rules" or "tool-selection" |
📖 Full API documentation: See for complete JSON I/O schemas.
For AI agents: See for the decision tree and selection guidance.
Agent Integration
To improve AI agent performance when using these tools, integrate the agent rules into your project:
| Resource | Purpose |
|---|---|
| Essential rules for efficient tool use and answer synthesis | |
| Decision tree for selecting the right tool |
Recommended: Include agent-quick-rules.md in your agent's system prompt or context to ensure:
- Proper tool selection workflow (
catalog_search→chunks_search) - Efficient stopping criteria (4-6 tool calls maximum)
- Synthesized answers instead of search narration
🚀 Quick Start
Prerequisites
- Node.js 18+
- Python 3.9+ with NLTK
- OpenRouter API key (sign up here)
- MCP Client (Cursor or Claude Desktop)
Installation
# Clone and build
git clone https://github.com/m2ux/concept-rag.git
cd concept-rag
npm install
npm run build
# Install WordNet
pip3 install nltk
python3 -c "import nltk; nltk.download('wordnet'); nltk.download('omw-1.4')"
# Configure API key
cp .env.example .env
# Edit .env and add your OpenRouter API key
Seed Your Documents
# Set environment
source .env
# Initial seeding (create database)
npx tsx hybrid_fast_seed.ts \
--dbpath ~/.concept_rag \
--filesdir ~/Documents/my-pdfs \
--overwrite
# Incremental seeding (add new documents only - much faster!)
npx tsx hybrid_fast_seed.ts \
--dbpath ~/.concept_rag \
--filesdir ~/Documents/my-pdfs
Seeding Options:
| Flag | Description |
|---|---|
--filesdir | Directory containing PDF/EPUB files (required) |
--dbpath | Database path (default: ~/.concept_rag) |
--overwrite | Drop and recreate all database tables |
--parallel N | Process N documents concurrently (default: 10, max: 25) |
--resume | Skip documents already in checkpoint (for interrupted runs) |
--clean-checkpoint | Clear checkpoint file and start fresh |
--rebuild-concepts | Rebuild concept index even if no new documents |
--auto-reseed | Re-process documents with incomplete metadata |
--max-docs N | Process at most N new documents (for batching) |
--with-wordnet | Enable WordNet enrichment (disabled by default) |
📝 Automatic Logging: Each run creates a timestamped log file in logs/seed-YYYY-MM-DDTHH-MM-SS.log for troubleshooting and audit trails.
📊 Progress Tracking: Real-time progress bars show document processing, concept extraction, and index building stages.
To seed specific documents
# Seed specific documents by hash prefix (shown in seeding output)
npx tsx scripts/seed_specific.ts --hash 3cde 7f2b
# Or by filename pattern
npx tsx scripts/seed_specific.ts --pattern "Transaction Processing"
Maintenance Scripts
# Health check - verify database integrity
npx tsx scripts/health-check.ts
# Rebuild derived name fields (after schema changes)
npx tsx scripts/rebuild_derived_names.ts --dbpath ~/.concept_rag
# Link related concepts (lexical similarity)
npx tsx scripts/link_related_concepts.ts --dbpath ~/.concept_rag
# Analyze backup differences
npx tsx scripts/analyze-backups.ts backup1/ backup2/
See for all maintenance utilities.
Configure MCP Client
Cursor (~/.cursor/mcp.json):
{
"mcpServers": {
"concept-rag": {
"command": "node",
"args": [
"/path/to/concept-rag/dist/conceptual_index.js",
"/home/username/.concept_rag"
]
}
}
}
Restart your MCP client and start searching!
📖 For complete setup instructions for various IDEs, see
🛠️ Development
Project Structure
src/
├── conceptual_index.ts # MCP server entry point
├── application/ # Composition root (DI)
├── domain/ # Domain models, services, interfaces
│ ├── models/ # Chunk, Concept, SearchResult
│ ├── services/ # Domain services (search logic)
│ └── interfaces/ # Repository and service interfaces
├── infrastructure/ # External integrations
│ ├── lancedb/ # Database adapters (normalized schema v7)
│ ├── embeddings/ # Embedding service
│ ├── search/ # Hybrid search with 4-signal scoring
│ ├── resilience/ # Circuit breaker, bulkhead, timeout patterns
│ ├── checkpoint/ # Resumable seeding with progress tracking
│ ├── cli/ # Progress bar display utilities
│ └── document-loaders/ # PDF, EPUB loaders with OCR fallback
├── concepts/ # Concept extraction & indexing
│ ├── concept_extractor.ts # LLM-based extraction
│ ├── parallel-concept-extractor.ts # Concurrent document processing
│ ├── concept_index.ts # Index builder with lexical linking
│ ├── query_expander.ts # Query expansion with WordNet
│ └── summary_generator.ts # LLM summary generation
├── wordnet/ # WordNet integration
└── tools/ # MCP tools (10 operations)
scripts/
├── health-check.ts # Database integrity verification
├── rebuild_derived_names.ts # Regenerate derived text fields
├── link_related_concepts.ts # Build concept relationship graph
├── seed_specific.ts # Targeted document re-seeding
└── analyze-backups.ts # Backup comparison and analysis
🏗️ Architecture
PDF/EPUB Documents
↓
Processing + OCR fallback
↓
┌─────────┼─────────┐
↓ ↓ ↓
Catalog Chunks Concepts Categories
(docs) (text) (index) (taxonomy)
└─────────┴─────────┴─────────┘
↓
Hybrid Search Engine
(Vector + BM25 + Concepts + WordNet)
Four-Table Normalized Schema:
- Catalog: Document metadata with derived
concept_names,category_names - Chunks: Text segments with
catalog_title,concept_names - Concepts: Deduplicated index with lexical/adjacent relationships
- Categories: Hierarchical taxonomy with statistics
See for complete schema documentation.
🎨 Design
This project follows well-documented architectural principles and design decisions:
Architecture Decision Records (ADRs)
All major technical decisions are documented in .
Documentation
- - Complete MCP tool documentation with JSON I/O schemas
- - Decision tree and usage guidance for AI agents
- - Four-table normalized schema with derived fields
- - Comprehensive test documentation with links to all E2E, integration, unit, and property tests
💬 Support & Community
🙏 Acknowledgments
This project is forked from lance-mcp by adiom-data. The original project provided the foundational MCP server architecture and LanceDB integration.
This fork extends the original with:
- 🔄 Recursive self-improvement - Used its own tools to discover and apply design patterns
- 📚 Formal concept model - Rigorous definition ensuring semantic matching and disambiguation
- 🧠 Enhanced concept extraction - 80-150+ concepts per document (Claude Sonnet 4.5)
- 🌐 WordNet semantic enrichment - Synonym expansion and hierarchical navigation
- 🔍 Multi-signal hybrid ranking - Vector + BM25 + title + concept + WordNet (4-signal scoring)
- 📖 Large document support - Multi-pass extraction for >100k token documents
- ⚡ Parallel concept extraction - Process up to 25 documents concurrently with shared rate limiting
- 🔁 Resumable seeding - Checkpoint-based recovery from interrupted runs
- 🛡️ System resilience - Circuit breaker, bulkhead, and timeout patterns for external services
- 📊 Normalized schema (v7) - Derived text fields eliminate ID cache lookups at runtime
- 🔗 Concept relationships - Adjacent (co-occurrence) and related (lexical) concept linking
- 🏥 Health checks - Database integrity verification with detailed reporting
- 🏗️ Clean Architecture - Domain-Driven Design patterns throughout (see )
We're grateful to the original author for creating and open-sourcing this excellent foundation!
Contributing
We welcome contributions! See for:
- Development setup
- Code style guidelines
- Pull request process
- Areas needing help
📜 License
This project is licensed under the MIT License - see the file for details.