matin-g/Docs-MCP
If you are the rightful owner of Docs-MCP and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.
A Model Context Protocol (MCP) server that provides semantic search over documentation sites, allowing users to index and search documentation using OpenAI embeddings.
Documentation Search MCP Server
A Model Context Protocol (MCP) server that provides semantic search over documentation sites. Index any documentation by URL, and search it from Claude Code, Cursor, or any MCP-compatible client.
Features
- 🔍 Semantic Search: OpenAI embeddings for intelligent documentation search
- 🌐 Auto-Discovery: Automatically finds and parses sitemaps
- 📦 Local Storage: ChromaDB for persistent, local vector storage
- 🎨 Simple GUI: Gradio interface for managing indexed sites
- 🔄 Easy Reindexing: Update documentation with one click
- 🚀 MCP Compatible: Works with Claude Code, Cursor, and other MCP clients
Installation
Prerequisites
- Python 3.10 or higher
- OpenAI API key (get one here)
Setup
- Clone or navigate to the project directory:
cd docs-mcp-server
- Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Configure OpenAI API key:
Create a .env file in the project root:
cp .env.example .env
Edit .env and add your OpenAI API key:
OPENAI_API_KEY=sk-...
Usage
1. Launch the GUI to Index Documentation
Start the Gradio interface:
python -m src.gui
This will open a web interface at http://127.0.0.1:7860 where you can:
- Add documentation sites by URL
- View indexed sites and statistics
- Reindex existing sites
- Delete sites
Example: Indexing LangGraph docs
- Go to the "Add Documentation Site" tab
- Enter base URL:
https://langchain-ai.github.io/langgraph/ - Leave sitemap URL empty (auto-discovery)
- Click "Index Site"
The indexer will:
- Find the sitemap automatically
- Crawl all pages
- Convert HTML to Markdown
- Generate embeddings
- Store in local ChromaDB
2. Configure MCP Server
For Claude Code
Add to your Claude Code MCP settings (~/.config/claude-code/mcp.json or via Claude Code settings):
{
"mcpServers": {
"docs-search": {
"command": "python",
"args": ["-m", "src.server"],
"cwd": "/absolute/path/to/docs-mcp-server",
"env": {
"OPENAI_API_KEY": "sk-..."
}
}
}
}
For Cursor
Add to Cursor MCP settings:
{
"mcpServers": {
"docs-search": {
"command": "python",
"args": ["-m", "src.server"],
"cwd": "/absolute/path/to/docs-mcp-server",
"env": {
"OPENAI_API_KEY": "sk-..."
}
}
}
}
3. Use the Search Tool
Once configured, you can use the search_docs tool in your MCP client:
Example queries:
- "How do I create a state graph in LangGraph?"
- "What are the different types of nodes in LangGraph?"
- "Show me examples of conditional edges"
The tool will return relevant documentation chunks with:
- Source URL
- Similarity score
- Page content
Project Structure
docs-mcp-server/
├── src/
│ ├── __init__.py # Package initialization
│ ├── server.py # MCP server implementation
│ ├── indexer.py # Documentation crawler and indexer
│ ├── embedder.py # OpenAI embedding generation
│ ├── db.py # ChromaDB wrapper
│ └── gui.py # Gradio management interface
├── data/
│ ├── chroma/ # ChromaDB storage (auto-created)
│ └── config.json # Indexed sites configuration
├── requirements.txt # Python dependencies
├── .env.example # Environment variables template
└── README.md # This file
How It Works
-
Indexing Pipeline:
- Discovers sitemap from base URL
- Fetches all pages from sitemap
- Converts HTML to clean Markdown
- Splits content into overlapping chunks
- Generates embeddings using OpenAI
- Stores in ChromaDB with metadata
-
Search Process:
- User query is embedded using OpenAI
- ChromaDB performs cosine similarity search
- Top results are returned with metadata
- Results include source URL and similarity score
Configuration Options
Indexing Parameters
When adding a site via GUI or code:
base_url: Main documentation URL (required)sitemap_url: Custom sitemap URL (optional, auto-discovered if not provided)max_pages: Limit number of pages to index (optional, useful for testing)
Chunking
Default chunk settings in indexer.py:
chunk_size: 1000 charactersoverlap: 200 characters
These can be adjusted in the chunk_text() method for your specific needs.
Embedding Model
Default: text-embedding-3-small (OpenAI)
To use a different model, modify embedder.py:
self.model = "text-embedding-3-large" # More accurate but more expensive
Troubleshooting
"No documentation has been indexed yet"
Run the GUI and add at least one documentation site before using the search tool.
"Could not find sitemap.xml"
Some sites don't have a sitemap. Try providing the sitemap URL manually or ensure the site has a publicly accessible sitemap.
"OpenAI API key not found"
Make sure your .env file exists and contains a valid OPENAI_API_KEY.
ChromaDB errors
Delete the data/chroma/ directory to reset the database:
rm -rf data/chroma/
Then reindex your sites.
Cost Estimation
OpenAI Embedding Costs (text-embedding-3-small):
- ~$0.02 per 1M tokens
- Average documentation site (500 pages): ~$0.10-0.50
- Search queries: ~$0.0001 per query
Storage:
- ChromaDB is stored locally (no cloud costs)
- Average site: 50-200 MB
Advanced Usage
Programmatic Indexing
You can index sites programmatically:
from src.embedder import Embedder
from src.db import DocsDatabase
from src.indexer import DocumentIndexer
embedder = Embedder(api_key="sk-...")
database = DocsDatabase()
indexer = DocumentIndexer(embedder, database)
result = indexer.index_site(
base_url="https://docs.example.com",
max_pages=100 # Optional limit
)
print(f"Indexed {result['pages_indexed']} pages")
Custom Search
from src.embedder import Embedder
from src.db import DocsDatabase
embedder = Embedder()
database = DocsDatabase()
# Search
query_embedding = embedder.embed_text("your query")
results = database.search_all_collections(query_embedding, n_results=10)
for result in results:
print(f"URL: {result['metadata']['url']}")
print(f"Content: {result['document'][:200]}...")
Roadmap
- Support for custom embedding models (local transformers)
- Incremental updates (detect changed pages)
- Better HTML parsing for specific doc frameworks
- Export/import indexed data
- REST API for search
- Support for PDF documentation
Contributing
Contributions welcome! Some ideas:
- Add support for more documentation formats
- Improve HTML to Markdown conversion
- Add more embedding providers
- Enhance the GUI
License
MIT License - feel free to use and modify!
Credits
Built with:
- MCP - Model Context Protocol
- ChromaDB - Vector database
- OpenAI - Embeddings
- Gradio - GUI framework
- BeautifulSoup - HTML parsing