StacklokLabs/toolhive-doc-mcp
If you are the rightful owner of toolhive-doc-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
The Stacklok Documentation Search MCP Server is designed to provide semantic search capabilities over Stacklok documentation using vector embeddings.
Stacklok Documentation Search MCP Server
MCP server for semantic search over Stacklok documentation using vector embeddings.
Features
- Multiple documentation sources: Supports both websites and GitHub repositories
- Website crawling with automatic page discovery
- GitHub repository markdown file fetching with glob pattern matching
- Configurable via YAML configuration file
- Robust HTML parsing: Multi-strategy content extraction with fallback handling
- Markdown processing: Native support for markdown files from GitHub repos
- Error-resilient: Handles timeouts, 404s, and network errors gracefully with exponential backoff
- Rate limiting: Configurable concurrent requests and delays to be respectful of documentation servers
- Semantic search: Vector-based similarity search using local embeddings
- Incremental sync: Efficient caching to avoid re-fetching unchanged pages
- GitHub authentication: Optional token support for higher API rate limits (5000/hour vs 60/hour)
Quick Start
1. Prerequisites
- Python 3.13+
- uv package manager
2. Install Dependencies
uv sync
3. Configuration
Environment Configuration
Copy .env.example to .env:
cp .env.example .env
Key environment configuration options are available in .env, but most settings are now in sources.yaml.
Available environment variables include:
OTEL_ENABLED: Enable/disable OpenTelemetry logging (default:true)OTEL_ENDPOINT: OpenTelemetry collector endpoint (default:http://otel-collector.otel.svc.cluster.local:4318)OTEL_SERVICE_NAME: Service name for telemetry (default:toolhive-doc-mcp)OTEL_SERVICE_VERSION: Service version for telemetry (default:1.0.0)
Sources Configuration
Copy sources.yaml.example to sources.yaml and customize your documentation sources:
cp sources.yaml.example sources.yaml
The sources.yaml file allows you to configure multiple documentation sources:
sources:
# Website sources - crawl and extract documentation from websites
websites:
- name: "Stacklok Toolhive Docs"
url: "https://docs.stacklok.com/toolhive"
path_prefix: "/toolhive"
enabled: true
# GitHub repository sources - fetch markdown files from specific repos
github_repos:
- name: "Stacklok Toolhive Docs"
repo_owner: "stacklok"
repo_name: "toolhive"
branch: "main"
paths:
- "docs/**/*.md"
- "README.md"
enabled: true
# Fetching configuration
fetching:
timeout: 30
max_retries: 3
concurrent_limit: 5
delay_ms: 100
max_depth: 5
# GitHub API configuration (optional)
github:
token: null # Or set GITHUB_TOKEN env var for higher rate limits
4. Build Documentation Index
Run the build process to fetch, parse, chunk, embed, and index all documentation:
uv run python src/build.py
This will:
- Load your sources configuration from
sources.yaml - Fetch documentation from all enabled website sources
- Fetch markdown files from all enabled GitHub repository sources
- Parse and chunk all content
- Generate embeddings using the local model (downloaded automatically on first run)
- Persist everything to the SQLite vector database
The process displays detailed progress and a summary at the end.
5. Start the MCP Server
uv run python src/mcp_server.py
Server will be available at: http://localhost:8080
6. Query the Server
curl -X POST http://localhost:8080/sse \
-H "Content-Type: application/json" \
-d '{
"jsonrpc": "2.0",
"id": 1,
"method": "tools/call",
"params": {
"name": "query_docs",
"arguments": {
"query": "What is toolhive?",
"limit": 5
}
}
}'
Docker Deployment
The Docker image includes the pre-built documentation database, making it ready to use immediately after building.
Building the Docker Image
Basic Build (without GitHub token)
docker build -t toolhive-doc-mcp:latest .
Note: Without a GitHub token, you may hit API rate limits (60 requests/hour). The build will still work but may fail to fetch some GitHub-based documentation sources.
Build with GitHub Token (Recommended)
For higher rate limits (5,000 requests/hour) and successful fetching of all sources:
-
Create a GitHub personal access token at https://github.com/settings/tokens (no special scopes needed for public repos)
-
Build with the token using Docker secrets (secure method):
# Save token to a file
echo "your_token_here" > .github_token
# Build with the secret
docker build --secret id=github_token,src=.github_token -t toolhive-doc-mcp:latest .
# Clean up the token file
rm .github_token
Or use an environment variable with Docker secrets:
export GITHUB_TOKEN=your_token_here
echo "$GITHUB_TOKEN" | docker build --secret id=github_token,src=/dev/stdin -t toolhive-doc-mcp:latest .
Note: This method uses Docker BuildKit secrets, which keeps your token out of the image layers and build history, improving security.
Running the Docker Container
docker run -p 8080:8080 toolhive-doc-mcp:latest
The MCP server will be available at http://localhost:8080
Docker Build Process
The Docker build performs the following steps:
- sqlite-vec-builder stage: Compiles the sqlite-vec extension
- builder stage: Installs Python dependencies using uv
- model-downloader stage: Pre-downloads the embedding model (BAAI/bge-small-en-v1.5)
- db-builder stage: Runs
src/build.pyto:- Fetch documentation from all configured sources
- Parse and chunk the content
- Generate embeddings
- Build the SQLite vector database
- runner stage: Creates the final minimal image with:
- The pre-built database
- The cached embedding model
- The MCP server
Customizing Documentation Sources
To customize which documentation sources are included in the Docker image:
- Edit
sources.yamlbefore building - Enable/disable sources as needed
- Build the Docker image with your customized configuration
Multi-stage Build Benefits
- Smaller final image: Build dependencies are not included in the runtime image
- Ready to use: Database is pre-built during image creation
- Faster startup: No need to download models or build database at runtime
- Reproducible: Same image always contains the same documentation snapshot
Development
Run Tests
task test
Code Quality
task format # Format code
task lint # Lint code
task typecheck # Type check
Architecture
Component Overview
- Website Fetching: httpx async client with retry logic and rate limiting
- GitHub Integration: GitHub API client with concurrent fetching and authentication support
- HTML Parsing: BeautifulSoup4 + lxml with multi-strategy content extraction
- Embeddings: Local fastembed model (BAAI/bge-small-en-v1.5) - no API keys required
- Vector Store: SQLite + sqlite_vec for vector similarity search
- MCP Server: FastMCP with HTTP/SSE protocol
- Caching: Filesystem-based HTML cache with JSON metadata
- Configuration: YAML-based with Pydantic validation
Build Process Flow
1. Load sources configuration (sources.yaml)
↓
2. Initialize services (embedder, vector store, etc.)
↓
3. Fetch from all sources
├─ Sync website sources (parallel)
│ └─ Fetch HTML pages with crawling
└─ Sync GitHub sources (parallel)
└─ Fetch markdown files via API
↓
4. Parse and chunk all content
├─ Parse HTML from websites
└─ Parse markdown from GitHub
↓
5. Generate embeddings (local model)
↓
6. Persist to vector database
↓
7. Update metadata
↓
8. Verify and display summary
Module Structure
sources.yaml (config)
↓
src/utils/sources_loader.py (validation)
↓
src/build.py (orchestration)
├─ src/services/doc_sync.py (websites)
│ └─ src/services/website_fetcher.py
│ └─ src/services/html_parser.py
└─ src/services/github_fetcher.py (GitHub)
↓
src/services/chunker.py
↓
src/services/embedder.py
↓
src/services/vector_store.py
Key Files
Configuration:
sources.yaml- Main configuration file for defining documentation sourcessources.yaml.example- Example configuration with multiple sourcessrc/models/sources_config.py- Pydantic models for configuration validationsrc/utils/sources_loader.py- Utility to load and validate configuration
Services:
src/services/github_fetcher.py- Service for fetching files from GitHub repositoriessrc/services/doc_sync.py- Website documentation synchronizationsrc/services/website_fetcher.py- HTTP client with retry logicsrc/services/html_parser.py- Multi-strategy HTML content extractionsrc/services/chunker.py- Document chunkingsrc/services/embedder.py- Local embedding generationsrc/services/vector_store.py- SQLite vector database management
Build:
src/build.py- Main build orchestration supporting multiple sourcessrc/mcp_server.py- MCP server implementation
Adding New Documentation Sources
Adding a Website Source
Add a new entry to the websites section in sources.yaml:
sources:
websites:
- name: "Your Documentation Site"
url: "https://docs.example.com"
path_prefix: "/" # Or specific path like "/docs"
enabled: true
Adding a GitHub Repository Source
Add a new entry to the github_repos section in sources.yaml:
sources:
github_repos:
- name: "Your Project Docs"
repo_owner: "your-org"
repo_name: "your-repo"
branch: "main" # Optional
paths:
- "docs/**/*.md"
- "*.md"
enabled: true
After updating sources.yaml, run the build process again to index the new sources.
GitHub Rate Limits
For public repositories, the GitHub API allows 60 requests per hour without authentication. If you need higher limits:
- Create a personal access token at https://github.com/settings/tokens
- Set it in
sources.yamlor as an environment variable:export GITHUB_TOKEN=your_token_here
This increases your limit to 5,000 requests per hour.
Implementation Details
GitHub Integration Features
- Fetches files using GitHub API with authentication support
- Supports glob patterns for file matching (e.g.,
docs/**/*.md) - Concurrent file fetching with configurable rate limiting
- Proper error handling and retry logic for network failures
- Respects GitHub API rate limits
Configuration Validation
The configuration system uses Pydantic models to validate:
- Required fields for each source type
- Valid URLs and repository identifiers
- Numeric ranges for fetching parameters
- Proper glob patterns for file matching
Testing
Run the tests to validate the implementation:
uv run pytest tests
Alternately, use task:
task test
Telemetry
The server includes built-in OpenTelemetry logging that captures query and response data for monitoring and analytics.
Features
- Automatic query logging: Captures all
query_docsandget_chunkcalls - Rich metadata: Logs query parameters, response metrics, timing information, and errors
- OpenTelemetry Logs API: Uses OTLP/HTTP logs (not traces) to avoid cardinality issues
- Low cardinality design: Query text in log body, structured attributes for filtering
- Configurable: Can be disabled or customized via environment variables
Configuration
Telemetry is enabled by default and sends logs to an OpenTelemetry collector. Configure via environment variables:
# Enable/disable telemetry
OTEL_ENABLED=true
# Collector endpoint (HTTP/protobuf)
OTEL_ENDPOINT=http://otel-collector.otel.svc.cluster.local:4318
# Service identification
OTEL_SERVICE_NAME=toolhive-doc-mcp
OTEL_SERVICE_VERSION=1.0.0
Captured Data
For each query, the telemetry system logs:
In Log Body (high-cardinality, full-text searchable):
- Query text (the actual search query)
- Chunk IDs and summary statistics
In Structured Attributes (low-cardinality, filterable):
- Tool name, timestamp, and query parameters (limit, query_type, min_score)
- Response metrics (result count, top score, query time, response size)
- Error information (error type and message)
This design prevents cardinality explosion in metrics/trace backends while enabling full-text search in log aggregation systems (Loki, Elasticsearch, etc.).
Disabling Telemetry
To disable telemetry, set OTEL_ENABLED=false in your environment or .env file.
For detailed telemetry documentation, see .
Dependencies
Key dependencies:
httpx- HTTP client with retry supportBeautifulSoup4+lxml- HTML parsingaiohttp- Async HTTP operationsfastembed- Local embeddings (no API required)sqlite-vec- Vector similarity searchpydantic- Configuration validationpyyaml- YAML configuration parsingopentelemetry-*- OpenTelemetry logging (OTLP/HTTP)
Future Enhancements
Possible improvements:
- Support for other source types (GitLab, Bitbucket, etc.)
- Selective re-indexing of specific sources
- Source-specific search filtering
- Automatic source discovery
- Webhook-based incremental updates
- Source-level metadata and tagging