robsannaa/crawlocs
If you are the rightful owner of crawlocs and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
All Docs MCP is a smart documentation crawler and search system that leverages the Model Context Protocol (MCP) to index and search documentation websites efficiently.
All Docs MCP
A smart documentation crawler and search system that automatically indexes documentation websites and makes them searchable through an MCP server.
π Quick Start
1. Setup Environment
# Clone and setup
git clone <your-repo>
cd all-docs-mcp
# Install dependencies
cd docs_manager
uv sync
2. Configure Environment
Create a .env
file in the root directory:
PINECONE_API_KEY=your_pinecone_api_key
PINECONE_INDEX_NAME=docs-mcp
OPENAI_API_KEY=your_openai_api_key
3. Crawl Documentation
cd docs_manager
python3 populate_db.py --library_name pandas --url https://pandas.pydata.org/docs/
4. Start the MCP Server
cd mcp_server
uv sync
python3 main.py
5. Use the Streamlit App
# In the root directory
streamlit run app.py
6. Test with MCP Inspector (Optional)
The MCP Inspector is a powerful tool for testing and debugging MCP servers. Use it to:
- Test search functionality directly
- Debug server responses
- Explore available tools and resources
- Monitor server logs
# Test the MCP server directly
npx @modelcontextprotocol/inspector \
uv \
--directory mcp_server \
run \
mcp-server \
--port 8000
Inspector Features:
- Resources Tab: Browse indexed documentation
- Tools Tab: Test search functionality
- Prompts Tab: Try different search queries
- Notifications: Monitor server logs and errors
π System Flow
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Documentation β β Crawl4AI β β Pinecone β
β Website βββββΆβ Deep Crawler βββββΆβ Vector DB β
β β β β β β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β β
βΌ βΌ
ββββββββββββββββββββ βββββββββββββββββββ
β OpenAI β β MCP Server β
β Embeddings β β (Search API) β
ββββββββββββββββββββ βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β Streamlit β
β UI β
βββββββββββββββββββ
π Project Structure
all-docs-mcp/
βββ docs_manager/ # Crawler and indexing service
β βββ populate_db.py # Main crawler script
β βββ vector_store.py # Pinecone integration
β βββ pyproject.toml # Dependencies
βββ mcp_server/ # MCP server for search
β βββ main.py # Server entry point
β βββ pyproject.toml # Dependencies
βββ app.py # Streamlit UI
βββ README.md # This file
π― Features
- Smart Crawling: Uses crawl4ai to deeply crawl documentation sites
- Content Filtering: Automatically skips gallery/index pages
- Vector Search: Stores content in Pinecone for semantic search
- MCP Integration: Provides search through MCP protocol
- Web UI: Streamlit interface for testing and exploration
π§ Configuration
Crawler Settings
- Max Depth: 20 levels deep
- Max Pages: 10,000 pages per library
- Word Threshold: 50 words minimum
- Chunk Size: 1000 characters
- Overlap: 200 characters
π Troubleshooting
"No documents found": Check if the library name matches the namespace in Pinecone "Metadata size exceeds limit": Content is automatically truncated "Crawl stopped early": Check the URL and site accessibility
MCP Inspector Issues
"Server connection failed":
- Ensure the MCP server is running (
python3 main.py
in mcp_server/) - Check if port 8000 is available
- Verify environment variables are set
"No tools available":
- Make sure you've crawled some documentation first
- Check Pinecone index has data in the correct namespace
- Verify the server is properly initialized
"Search returns no results":
- Test with a simple query like "documentation" or "help"
- Check the library name matches your crawled data
- Verify the Pinecone index contains the expected data
π Example Usage
# Crawl pandas documentation
python3 populate_db.py --library_name pandas --url https://pandas.pydata.org/docs/
# Crawl matplotlib documentation
python3 populate_db.py --library_name matplotlib --url https://matplotlib.org/stable/
# Search through the UI
streamlit run app.py