toolhive-doc-mcp by StacklokLabs - MCP Server

Stacklok Documentation Search MCP Server

MCP server for semantic search over Stacklok documentation using vector embeddings.

Features

Multiple documentation sources: Supports both websites and GitHub repositories
- Website crawling with automatic page discovery
- GitHub repository markdown file fetching with glob pattern matching
- Configurable via YAML configuration file
Robust HTML parsing: Multi-strategy content extraction with fallback handling
Markdown processing: Native support for markdown files from GitHub repos
Error-resilient: Handles timeouts, 404s, and network errors gracefully with exponential backoff
Rate limiting: Configurable concurrent requests and delays to be respectful of documentation servers
Semantic search: Vector-based similarity search using local embeddings
Incremental sync: Efficient caching to avoid re-fetching unchanged pages
GitHub authentication: Optional token support for higher API rate limits (5000/hour vs 60/hour)

Quick Start

1. Prerequisites

Python 3.13+
uv package manager

2. Install Dependencies

uv sync

3. Configuration

Environment Configuration

Copy .env.example to .env:

cp .env.example .env

Key environment configuration options are available in .env, but most settings are now in sources.yaml.

Available environment variables include:

OTEL_ENABLED: Enable/disable OpenTelemetry logging (default: true)
OTEL_ENDPOINT: OpenTelemetry collector endpoint (default: http://otel-collector.otel.svc.cluster.local:4318)
OTEL_SERVICE_NAME: Service name for telemetry (default: toolhive-doc-mcp)
OTEL_SERVICE_VERSION: Service version for telemetry (default: 1.0.0)

Sources Configuration

Copy sources.yaml.example to sources.yaml and customize your documentation sources:

cp sources.yaml.example sources.yaml

The sources.yaml file allows you to configure multiple documentation sources:

sources:
  # Website sources - crawl and extract documentation from websites
  websites:
    - name: "Stacklok Toolhive Docs"
      url: "https://docs.stacklok.com/toolhive"
      path_prefix: "/toolhive"
      enabled: true

  # GitHub repository sources - fetch markdown files from specific repos
  github_repos:
    - name: "Stacklok Toolhive Docs"
      repo_owner: "stacklok"
      repo_name: "toolhive"
      branch: "main"
      paths:
        - "docs/**/*.md"
        - "README.md"
      enabled: true

# Fetching configuration
fetching:
  timeout: 30
  max_retries: 3
  concurrent_limit: 5
  delay_ms: 100
  max_depth: 5

# GitHub API configuration (optional)
github:
  token: null  # Or set GITHUB_TOKEN env var for higher rate limits

4. Build Documentation Index

Run the build process to fetch, parse, chunk, embed, and index all documentation:

uv run python src/build.py

This will:

Load your sources configuration from sources.yaml
Fetch documentation from all enabled website sources
Fetch markdown files from all enabled GitHub repository sources
Parse and chunk all content
Generate embeddings using the local model (downloaded automatically on first run)
Persist everything to the SQLite vector database

The process displays detailed progress and a summary at the end.

5. Start the MCP Server

uv run python src/mcp_server.py

Server will be available at: http://localhost:8080

6. Query the Server

curl -X POST http://localhost:8080/sse \
  -H "Content-Type: application/json" \
  -d '{
    "jsonrpc": "2.0",
    "id": 1,
    "method": "tools/call",
    "params": {
      "name": "query_docs",
      "arguments": {
        "query": "What is toolhive?",
        "limit": 5
      }
    }
  }'

Docker Deployment

The Docker image includes the pre-built documentation database, making it ready to use immediately after building.

Building the Docker Image

Basic Build (without GitHub token)

docker build -t toolhive-doc-mcp:latest .

Note: Without a GitHub token, you may hit API rate limits (60 requests/hour). The build will still work but may fail to fetch some GitHub-based documentation sources.

Build with GitHub Token (Recommended)

For higher rate limits (5,000 requests/hour) and successful fetching of all sources:

Create a GitHub personal access token at https://github.com/settings/tokens (no special scopes needed for public repos)
Build with the token using Docker secrets (secure method):

# Save token to a file
echo "your_token_here" > .github_token

# Build with the secret
docker build --secret id=github_token,src=.github_token -t toolhive-doc-mcp:latest .

# Clean up the token file
rm .github_token

Or use an environment variable with Docker secrets:

export GITHUB_TOKEN=your_token_here
echo "$GITHUB_TOKEN" | docker build --secret id=github_token,src=/dev/stdin -t toolhive-doc-mcp:latest .

Note: This method uses Docker BuildKit secrets, which keeps your token out of the image layers and build history, improving security.

Running the Docker Container

docker run -p 8080:8080 toolhive-doc-mcp:latest

The MCP server will be available at http://localhost:8080

Docker Build Process

The Docker build performs the following steps:

sqlite-vec-builder stage: Compiles the sqlite-vec extension
builder stage: Installs Python dependencies using uv
model-downloader stage: Pre-downloads the embedding model (BAAI/bge-small-en-v1.5)
db-builder stage: Runs src/build.py to:
- Fetch documentation from all configured sources
- Parse and chunk the content
- Generate embeddings
- Build the SQLite vector database
runner stage: Creates the final minimal image with:
- The pre-built database
- The cached embedding model
- The MCP server

Customizing Documentation Sources

To customize which documentation sources are included in the Docker image:

Edit sources.yaml before building
Enable/disable sources as needed
Build the Docker image with your customized configuration

Multi-stage Build Benefits

Smaller final image: Build dependencies are not included in the runtime image
Ready to use: Database is pre-built during image creation
Faster startup: No need to download models or build database at runtime
Reproducible: Same image always contains the same documentation snapshot

Development

Run Tests

task test

Code Quality

task format  # Format code
task lint    # Lint code
task typecheck  # Type check

Architecture

Component Overview

Website Fetching: httpx async client with retry logic and rate limiting
GitHub Integration: GitHub API client with concurrent fetching and authentication support
HTML Parsing: BeautifulSoup4 + lxml with multi-strategy content extraction
Embeddings: Local fastembed model (BAAI/bge-small-en-v1.5) - no API keys required
Vector Store: SQLite + sqlite_vec for vector similarity search
MCP Server: FastMCP with HTTP/SSE protocol
Caching: Filesystem-based HTML cache with JSON metadata
Configuration: YAML-based with Pydantic validation

Build Process Flow

1. Load sources configuration (sources.yaml)
   ↓
2. Initialize services (embedder, vector store, etc.)
   ↓
3. Fetch from all sources
   ├─ Sync website sources (parallel)
   │  └─ Fetch HTML pages with crawling
   └─ Sync GitHub sources (parallel)
      └─ Fetch markdown files via API
   ↓
4. Parse and chunk all content
   ├─ Parse HTML from websites
   └─ Parse markdown from GitHub
   ↓
5. Generate embeddings (local model)
   ↓
6. Persist to vector database
   ↓
7. Update metadata
   ↓
8. Verify and display summary

Module Structure

sources.yaml (config)
     ↓
src/utils/sources_loader.py (validation)
     ↓
src/build.py (orchestration)
     ├─ src/services/doc_sync.py (websites)
     │   └─ src/services/website_fetcher.py
     │       └─ src/services/html_parser.py
     └─ src/services/github_fetcher.py (GitHub)
         ↓
     src/services/chunker.py
         ↓
     src/services/embedder.py
         ↓
     src/services/vector_store.py

Key Files

Configuration:

sources.yaml - Main configuration file for defining documentation sources
sources.yaml.example - Example configuration with multiple sources
src/models/sources_config.py - Pydantic models for configuration validation
src/utils/sources_loader.py - Utility to load and validate configuration

Services:

src/services/github_fetcher.py - Service for fetching files from GitHub repositories
src/services/doc_sync.py - Website documentation synchronization
src/services/website_fetcher.py - HTTP client with retry logic
src/services/html_parser.py - Multi-strategy HTML content extraction
src/services/chunker.py - Document chunking
src/services/embedder.py - Local embedding generation
src/services/vector_store.py - SQLite vector database management

Build:

src/build.py - Main build orchestration supporting multiple sources
src/mcp_server.py - MCP server implementation

Adding New Documentation Sources

Adding a Website Source

Add a new entry to the websites section in sources.yaml:

sources:
  websites:
    - name: "Your Documentation Site"
      url: "https://docs.example.com"
      path_prefix: "/"  # Or specific path like "/docs"
      enabled: true

Adding a GitHub Repository Source

Add a new entry to the github_repos section in sources.yaml:

sources:
  github_repos:
    - name: "Your Project Docs"
      repo_owner: "your-org"
      repo_name: "your-repo"
      branch: "main"  # Optional
      paths:
        - "docs/**/*.md"
        - "*.md"
      enabled: true

After updating sources.yaml, run the build process again to index the new sources.

GitHub Rate Limits

For public repositories, the GitHub API allows 60 requests per hour without authentication. If you need higher limits:

Create a personal access token at https://github.com/settings/tokens
Set it in sources.yaml or as an environment variable:
```
export GITHUB_TOKEN=your_token_here
```

This increases your limit to 5,000 requests per hour.

Implementation Details

GitHub Integration Features

Fetches files using GitHub API with authentication support
Supports glob patterns for file matching (e.g., docs/**/*.md)
Concurrent file fetching with configurable rate limiting
Proper error handling and retry logic for network failures
Respects GitHub API rate limits

Configuration Validation

The configuration system uses Pydantic models to validate:

Required fields for each source type
Valid URLs and repository identifiers
Numeric ranges for fetching parameters
Proper glob patterns for file matching

Testing

Run the tests to validate the implementation:

uv run pytest tests

Alternately, use task:

task test

Telemetry

The server includes built-in OpenTelemetry logging that captures query and response data for monitoring and analytics.

Features

Automatic query logging: Captures all query_docs and get_chunk calls
Rich metadata: Logs query parameters, response metrics, timing information, and errors
OpenTelemetry Logs API: Uses OTLP/HTTP logs (not traces) to avoid cardinality issues
Low cardinality design: Query text in log body, structured attributes for filtering
Configurable: Can be disabled or customized via environment variables

Configuration

Telemetry is enabled by default and sends logs to an OpenTelemetry collector. Configure via environment variables:

# Enable/disable telemetry
OTEL_ENABLED=true

# Collector endpoint (HTTP/protobuf)
OTEL_ENDPOINT=http://otel-collector.otel.svc.cluster.local:4318

# Service identification
OTEL_SERVICE_NAME=toolhive-doc-mcp
OTEL_SERVICE_VERSION=1.0.0

Captured Data

For each query, the telemetry system logs:

In Log Body (high-cardinality, full-text searchable):

Query text (the actual search query)
Chunk IDs and summary statistics

In Structured Attributes (low-cardinality, filterable):

Tool name, timestamp, and query parameters (limit, query_type, min_score)
Response metrics (result count, top score, query time, response size)
Error information (error type and message)

This design prevents cardinality explosion in metrics/trace backends while enabling full-text search in log aggregation systems (Loki, Elasticsearch, etc.).

Disabling Telemetry

To disable telemetry, set OTEL_ENABLED=false in your environment or .env file.

For detailed telemetry documentation, see .

Dependencies

Key dependencies:

httpx - HTTP client with retry support
BeautifulSoup4 + lxml - HTML parsing
aiohttp - Async HTTP operations
fastembed - Local embeddings (no API required)
sqlite-vec - Vector similarity search
pydantic - Configuration validation
pyyaml - YAML configuration parsing
opentelemetry-* - OpenTelemetry logging (OTLP/HTTP)

Future Enhancements

Possible improvements:

Support for other source types (GitLab, Bitbucket, etc.)
Selective re-indexing of specific sources
Source-specific search filtering
Automatic source discovery
Webhook-based incremental updates
Source-level metadata and tagging