crawlocs

robsannaa/crawlocs

3.2

If you are the rightful owner of crawlocs and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

All Docs MCP is a smart documentation crawler and search system that leverages the Model Context Protocol (MCP) to index and search documentation websites efficiently.

All Docs MCP

A smart documentation crawler and search system that automatically indexes documentation websites and makes them searchable through an MCP server.

πŸš€ Quick Start

1. Setup Environment

# Clone and setup
git clone <your-repo>
cd all-docs-mcp

# Install dependencies
cd docs_manager
uv sync

2. Configure Environment

Create a .env file in the root directory:

PINECONE_API_KEY=your_pinecone_api_key
PINECONE_INDEX_NAME=docs-mcp
OPENAI_API_KEY=your_openai_api_key

3. Crawl Documentation

cd docs_manager
python3 populate_db.py --library_name pandas --url https://pandas.pydata.org/docs/

4. Start the MCP Server

cd mcp_server
uv sync
python3 main.py

5. Use the Streamlit App

# In the root directory
streamlit run app.py

6. Test with MCP Inspector (Optional)

The MCP Inspector is a powerful tool for testing and debugging MCP servers. Use it to:

  • Test search functionality directly
  • Debug server responses
  • Explore available tools and resources
  • Monitor server logs
# Test the MCP server directly
npx @modelcontextprotocol/inspector \
  uv \
  --directory mcp_server \
  run \
  mcp-server \
  --port 8000

Inspector Features:

  • Resources Tab: Browse indexed documentation
  • Tools Tab: Test search functionality
  • Prompts Tab: Try different search queries
  • Notifications: Monitor server logs and errors

πŸ”„ System Flow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Documentation β”‚    β”‚   Crawl4AI       β”‚    β”‚   Pinecone      β”‚
β”‚   Website       │───▢│   Deep Crawler   │───▢│   Vector DB     β”‚
β”‚                 β”‚    β”‚                  β”‚    β”‚                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚                        β”‚
                                β–Ό                        β–Ό
                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                       β”‚   OpenAI         β”‚    β”‚   MCP Server    β”‚
                       β”‚   Embeddings     β”‚    β”‚   (Search API)  β”‚
                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                        β”‚
                                                        β–Ό
                                               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                               β”‚   Streamlit     β”‚
                                               β”‚   UI            β”‚
                                               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“ Project Structure

all-docs-mcp/
β”œβ”€β”€ docs_manager/          # Crawler and indexing service
β”‚   β”œβ”€β”€ populate_db.py     # Main crawler script
β”‚   β”œβ”€β”€ vector_store.py    # Pinecone integration
β”‚   └── pyproject.toml     # Dependencies
β”œβ”€β”€ mcp_server/           # MCP server for search
β”‚   β”œβ”€β”€ main.py           # Server entry point
β”‚   └── pyproject.toml    # Dependencies
β”œβ”€β”€ app.py                # Streamlit UI
└── README.md            # This file

🎯 Features

  • Smart Crawling: Uses crawl4ai to deeply crawl documentation sites
  • Content Filtering: Automatically skips gallery/index pages
  • Vector Search: Stores content in Pinecone for semantic search
  • MCP Integration: Provides search through MCP protocol
  • Web UI: Streamlit interface for testing and exploration

πŸ”§ Configuration

Crawler Settings

  • Max Depth: 20 levels deep
  • Max Pages: 10,000 pages per library
  • Word Threshold: 50 words minimum
  • Chunk Size: 1000 characters
  • Overlap: 200 characters

πŸ› Troubleshooting

"No documents found": Check if the library name matches the namespace in Pinecone "Metadata size exceeds limit": Content is automatically truncated "Crawl stopped early": Check the URL and site accessibility

MCP Inspector Issues

"Server connection failed":

  • Ensure the MCP server is running (python3 main.py in mcp_server/)
  • Check if port 8000 is available
  • Verify environment variables are set

"No tools available":

  • Make sure you've crawled some documentation first
  • Check Pinecone index has data in the correct namespace
  • Verify the server is properly initialized

"Search returns no results":

  • Test with a simple query like "documentation" or "help"
  • Check the library name matches your crawled data
  • Verify the Pinecone index contains the expected data

πŸ“ Example Usage

# Crawl pandas documentation
python3 populate_db.py --library_name pandas --url https://pandas.pydata.org/docs/

# Crawl matplotlib documentation
python3 populate_db.py --library_name matplotlib --url https://matplotlib.org/stable/

# Search through the UI
streamlit run app.py