crawlocs

robsannaa/crawlocs

3.1

If you are the rightful owner of crawlocs and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

All Docs MCP is a smart documentation crawler and search system that leverages the Model Context Protocol (MCP) to index and search documentation websites efficiently.

All Docs MCP

A smart documentation crawler and search system that automatically indexes documentation websites and makes them searchable through an MCP server.

🚀 Quick Start

1. Setup Environment

# Clone and setup
git clone <your-repo>
cd all-docs-mcp

# Install dependencies
cd docs_manager
uv sync

2. Configure Environment

Create a .env file in the root directory:

PINECONE_API_KEY=your_pinecone_api_key
PINECONE_INDEX_NAME=docs-mcp
OPENAI_API_KEY=your_openai_api_key

3. Crawl Documentation

cd docs_manager
python3 populate_db.py --library_name pandas --url https://pandas.pydata.org/docs/

4. Start the MCP Server

cd mcp_server
uv sync
python3 main.py

5. Use the Streamlit App

# In the root directory
streamlit run app.py

6. Test with MCP Inspector (Optional)

The MCP Inspector is a powerful tool for testing and debugging MCP servers. Use it to:

  • Test search functionality directly
  • Debug server responses
  • Explore available tools and resources
  • Monitor server logs
# Test the MCP server directly
npx @modelcontextprotocol/inspector \
  uv \
  --directory mcp_server \
  run \
  mcp-server \
  --port 8000

Inspector Features:

  • Resources Tab: Browse indexed documentation
  • Tools Tab: Test search functionality
  • Prompts Tab: Try different search queries
  • Notifications: Monitor server logs and errors

🔄 System Flow

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Documentation │    │   Crawl4AI       │    │   Pinecone      │
│   Website       │───▶│   Deep Crawler   │───▶│   Vector DB     │
│                 │    │                  │    │                 │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                                │                        │
                                ▼                        ▼
                       ┌──────────────────┐    ┌─────────────────┐
                       │   OpenAI         │    │   MCP Server    │
                       │   Embeddings     │    │   (Search API)  │
                       └──────────────────┘    └─────────────────┘
                                                        │
                                                        ▼
                                               ┌─────────────────┐
                                               │   Streamlit     │
                                               │   UI            │
                                               └─────────────────┘

📁 Project Structure

all-docs-mcp/
├── docs_manager/          # Crawler and indexing service
│   ├── populate_db.py     # Main crawler script
│   ├── vector_store.py    # Pinecone integration
│   └── pyproject.toml     # Dependencies
├── mcp_server/           # MCP server for search
│   ├── main.py           # Server entry point
│   └── pyproject.toml    # Dependencies
├── app.py                # Streamlit UI
└── README.md            # This file

🎯 Features

  • Smart Crawling: Uses crawl4ai to deeply crawl documentation sites
  • Content Filtering: Automatically skips gallery/index pages
  • Vector Search: Stores content in Pinecone for semantic search
  • MCP Integration: Provides search through MCP protocol
  • Web UI: Streamlit interface for testing and exploration

🔧 Configuration

Crawler Settings

  • Max Depth: 20 levels deep
  • Max Pages: 10,000 pages per library
  • Word Threshold: 50 words minimum
  • Chunk Size: 1000 characters
  • Overlap: 200 characters

🐛 Troubleshooting

"No documents found": Check if the library name matches the namespace in Pinecone "Metadata size exceeds limit": Content is automatically truncated "Crawl stopped early": Check the URL and site accessibility

MCP Inspector Issues

"Server connection failed":

  • Ensure the MCP server is running (python3 main.py in mcp_server/)
  • Check if port 8000 is available
  • Verify environment variables are set

"No tools available":

  • Make sure you've crawled some documentation first
  • Check Pinecone index has data in the correct namespace
  • Verify the server is properly initialized

"Search returns no results":

  • Test with a simple query like "documentation" or "help"
  • Check the library name matches your crawled data
  • Verify the Pinecone index contains the expected data

📝 Example Usage

# Crawl pandas documentation
python3 populate_db.py --library_name pandas --url https://pandas.pydata.org/docs/

# Crawl matplotlib documentation
python3 populate_db.py --library_name matplotlib --url https://matplotlib.org/stable/

# Search through the UI
streamlit run app.py