docrag by ryan-m-bishop - MCP Server

DocRAG - AI Documentation RAG System

A lightweight, installable Python package that provides RAG (Retrieval Augmented Generation) access to technical documentation through an MCP (Model Context Protocol) server. This enables LLMs to search and retrieve relevant documentation on-demand.

Features

🚀 Single pip-installable package with CLI and MCP server
📚 Project-based documentation collections (BrightSign, Venafi, Qumu, web frameworks)
🔍 Local vector database with efficient embedding using LanceDB
📥 Easy documentation ingestion from local files or scraped sources
🤖 Designed for use with Claude Code via MCP

Installation

Prerequisites

Python 3.10+
pipx (recommended) or pip
git (for updates)

Recommended: Install globally with pipx

# Install globally with pipx in editable mode (keeps dependencies isolated)
pipx install -e /opt/claude-ops/doc-rag

# Verify installation
docrag --help

# Optional: Install Playwright browsers (for scraping)
pipx runpip docrag install playwright
pipx run --spec docrag playwright install chromium

Note: The -e flag installs in "editable" mode, which means changes to the source code are immediately reflected without reinstalling.

Alternative: Install from source (development)

# Clone or navigate to the project directory
cd /opt/claude-ops/doc-rag

# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate

# Install in development mode
pip install -e ".[dev]"

# Install Playwright browsers (for scraping)
playwright install chromium

Updating DocRAG

Option 1: Using the Update Script (Recommended)

cd /opt/claude-ops/doc-rag
./update.sh

This script will:

Pull latest changes from git
Detect your installation method (pipx or pip)
Reinstall only if necessary (non-editable installs)
Handle editable installs automatically

Option 2: Using Make

cd /opt/claude-ops/doc-rag
make update

Option 3: Manual Update

For editable installs (installed with -e):

cd /opt/claude-ops/doc-rag
git pull origin main
# No reinstall needed - changes are already active!

For regular installs (installed without -e):

cd /opt/claude-ops/doc-rag
git pull origin main
pipx uninstall docrag && pipx install -e .
# or for pip: pip install -e . --force-reinstall

Verifying Updates

# Check git status
cd /opt/claude-ops/doc-rag
git log -1 --oneline

# Test the installation
docrag --version
docrag --help

Quick Start

1. Initialize DocRAG

docrag init

This creates the configuration directory at ~/.docrag/ with the following structure:

~/.docrag/
├── config.json           # Global configuration
├── collections/          # Documentation collections
└── vectordb/            # LanceDB storage

2. Add a Documentation Collection

# Add documentation from a local directory
docrag add brightsign --source /path/to/brightsign/docs --description "BrightSign player documentation"

# Or add without source initially
docrag add venafi --description "Venafi TPP API documentation"

3. List Collections

docrag list

4. Search Documentation (CLI Testing)

# Search across all active collections
docrag search "how to initialize the player"

# Search a specific collection
docrag search "authentication methods" --collection venafi --limit 10

5. Start the MCP Server

docrag serve

The server will listen on stdio for connections from Claude Code.

CLI Commands

`docrag init`

Initialize DocRAG configuration directory.

`docrag add <name>`

Add a new documentation collection.

Options:

-s, --source PATH - Source directory containing documentation
-d, --description TEXT - Description of the collection

Example:

docrag add qumu --source ~/docs/qumu --description "Qumu video platform docs"

`docrag list`

List all documentation collections with their status.

`docrag update <name> <source>`

Update an existing collection with new documents.

Example:

docrag update brightsign ~/docs/brightsign/updated

`docrag remove <name>`

Remove a documentation collection (with confirmation).

`docrag search <query>`

Search documentation from the CLI for testing.

Options:

-c, --collection TEXT - Specific collection to search
-l, --limit INTEGER - Number of results (default: 5)

Example:

docrag search "websocket connection" --collection brightsign

`docrag serve`

Start the MCP server for Claude Code integration.

`docrag scrape <url>`

Scrape documentation from websites.

Options:

-o, --output PATH - Output directory (required)
--smart, --use-crawl4ai - Use AI-powered Crawl4AI scraper (recommended)
--no-llm - Disable LLM extraction (faster, still better than basic)
--llm-provider TEXT - LLM provider (default: openai/gpt-4o-mini)
--playwright - Use Playwright for dynamic content (basic scraper)
--max-pages INTEGER - Maximum pages to scrape (default: 1000)

Examples:

# Basic scraping
docrag scrape https://docs.example.com --output ./docs

# Smart scraping with AI (recommended)
docrag scrape https://docs.example.com --output ./docs --smart

# Smart scraping without LLM (faster, no API key needed)
docrag scrape https://docs.example.com --output ./docs --smart --no-llm

# Limit pages
docrag scrape https://docs.example.com --output ./docs --max-pages 100

Smart Scraping Features:

✨ AI-powered content extraction
🎯 Automatically removes navigation and boilerplate
📊 Better handling of complex layouts
🧠 Semantic understanding of documentation structure
⚡ Faster and more accurate than basic scraping

To enable smart scraping:

# Install Crawl4AI
pipx inject docrag crawl4ai

# Optional: Set OpenAI API key for LLM-powered extraction
export OPENAI_API_KEY='your-key-here'

Using with Claude Code

1. Configure Claude Code MCP Settings

Add DocRAG to your Claude Code MCP configuration (~/.config/claude-code/mcp_settings.json or similar):

{
  "mcpServers": {
    "docrag": {
      "command": "docrag",
      "args": ["serve"],
      "env": {}
    }
  }
}

If using the full path:

{
  "mcpServers": {
    "docrag": {
      "command": "/home/claude-admin/.local/bin/docrag",
      "args": ["serve"],
      "env": {}
    }
  }
}

2. Restart Claude Code

After adding the configuration, restart Claude Code to load the MCP server.

3. Use in Claude Code

Once connected, Claude Code can use two tools:

search_docs: Search through indexed documentation collections

Query: "how to handle authentication in BrightSign"
Collection: (optional) "brightsign"
Limit: (optional) 5

list_collections: List all available documentation collections

Claude will automatically use these tools when working on projects that need documentation access.

Architecture

Core Components

ConfigManager (config.py) - Manages configuration and collection metadata
EmbeddingGenerator (embeddings.py) - Generates embeddings using sentence-transformers
VectorDB (vectordb.py) - LanceDB wrapper for vector storage and search
DocumentIndexer (indexer.py) - Intelligent document chunking and indexing
DocRAGServer (server.py) - MCP server implementation
CLI (cli.py) - Command-line interface

Technical Stack

MCP Framework: Official Anthropic MCP package
Vector Database: LanceDB (lightweight, file-based, performant)
Embeddings: sentence-transformers with all-MiniLM-L6-v2 model (384 dims, fast, local)
Text Processing: langchain-text-splitters for intelligent chunking
CLI: Click for user-friendly commands
Web Scraping: Playwright + BeautifulSoup4 for scraping

Data Structure

~/.docrag/
├── config.json                 # Global configuration
│   └── {
│         "active_collections": ["brightsign", "venafi"],
│         "embedding_model": "sentence-transformers/all-MiniLM-L6-v2",
│         "chunk_size": 512,
│         "chunk_overlap": 50
│       }
├── collections/
│   ├── brightsign/
│   │   ├── metadata.json       # Collection metadata
│   │   └── source_docs/        # Original documents
│   ├── venafi/
│   └── qumu/
└── vectordb/
    └── lancedb/                # Vector storage (one table per collection)

Configuration

Global configuration is stored in ~/.docrag/config.json:

{
  "active_collections": ["brightsign", "venafi"],
  "embedding_model": "sentence-transformers/all-MiniLM-L6-v2",
  "chunk_size": 512,
  "chunk_overlap": 50
}

Collection metadata is stored in ~/.docrag/collections/<name>/metadata.json:

{
  "name": "brightsign",
  "source_type": "local",
  "source_path": "/path/to/docs",
  "created_at": "2025-10-28T10:00:00",
  "updated_at": "2025-10-28T10:00:00",
  "doc_count": 150,
  "description": "BrightSign player documentation"
}

Development

Project Structure

docrag/
├── docrag/
│   ├── __init__.py
│   ├── cli.py              # CLI commands
│   ├── server.py           # MCP server
│   ├── indexer.py          # Document indexing
│   ├── vectordb.py         # Vector database
│   ├── embeddings.py       # Embeddings
│   ├── config.py           # Configuration
│   └── scrapers/           # Web scrapers
│       ├── __init__.py
│       ├── base.py
│       └── generic.py
├── tests/
├── pyproject.toml
├── README.md
└── DOCRAG_MVP_BUILD_GUIDE.md

Running Tests

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

Code Formatting

# Format with black
black docrag/

# Lint with ruff
ruff check docrag/

Troubleshooting

"DocRAG not initialized"

Run docrag init first to create the configuration directory.

"No collections found"

Add a collection with docrag add <name> --source <path>.

"Model download fails"

The first time you run DocRAG, it will download the sentence-transformers model (~100MB). Ensure you have internet connectivity.

"Playwright not installed"

If using scrapers, run playwright install chromium.

Future Enhancements

Web scraper CLI commands
Support for more file types (PDF, HTML, RST)
Incremental indexing (only index changed files)
Collection activation/deactivation
Collection statistics and health checks
Export/import collections
Cloud sync for collections
Advanced search filters

License

MIT

Author

Ryan - Built for homelab and Claude Code integration