crawlerr

jmagar/crawlerr

3.2

If you are the rightful owner of crawlerr and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

Crawlerr is an advanced RAG-enabled web crawling MCP server that integrates semantic search capabilities with web crawling, built using FastMCP 2.0, Crawl4AI 0.7.0, Qdrant vector database, and Qwen3-Embedding-0.6B.

Tools
6
Resources
0
Prompts
0

Crawlerr - Advanced RAG-Enabled Web Crawling MCP Server

A powerful, production-ready Model Context Protocol (MCP) server that combines advanced web crawling capabilities with semantic search through RAG (Retrieval-Augmented Generation). Built with FastMCP 2.0, Crawl4AI 0.7.0, Qdrant vector database, and Qwen3-Embedding-0.6B for multilingual semantic understanding.

✨ Features

šŸ•·ļø Advanced Web Crawling

  • Single Page Scraping: Extract content, metadata, images, and links from individual pages
  • Site-wide Crawling: Intelligent sitemap.xml parsing with recursive fallback
  • Repository Crawling: Clone and analyze GitHub repositories with code-aware processing
  • Directory Crawling: Process local file systems with document format support
  • Adaptive Intelligence: Smart crawling that knows when to stop based on content sufficiency

🧠 RAG-Powered Search

  • Semantic Search: Vector-based similarity search using Qwen3-Embedding-0.6B
  • Hybrid Search: Combines semantic and keyword-based filtering
  • Metadata-Rich Results: Comprehensive source tracking and context preservation
  • Query Expansion: Intelligent query enhancement for better results

šŸ—ļø Modern Architecture

  • FastMCP 2.0: Streamable HTTP transport with real-time progress updates
  • Qdrant Vector DB: High-performance vector storage and retrieval
  • HF TEI Integration: Optimized embedding generation via Text Embeddings Inference
  • Docker Compose: Containerized deployment with service orchestration

šŸ› ļø Enterprise Features

  • Middleware Support: Logging, error handling, and progress tracking
  • Resource Endpoints: Expose crawled data as MCP resources
  • Prompt Templates: Reusable crawling and analysis templates
  • Component Management: Runtime tool enabling/disabling
  • Comprehensive Monitoring: Health checks, metrics, and observability

šŸš€ Quick Start

Prerequisites

  • Docker and Docker Compose
  • Python 3.9+
  • 4GB+ RAM for optimal performance

1. Clone and Setup

git clone <repository-url>
cd crawlerr

2. Start Services

# Start Qdrant and HF TEI services
docker-compose up -d

# Wait for services to be healthy
docker-compose ps

3. Install Dependencies

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install packages
pip install -r requirements.txt

4. Run the Server

# Start the MCP server
python src/server.py

The server will be available at http://localhost:8000 with streamable HTTP transport.

šŸ› ļø Available Tools

scrape

Single page crawling with advanced extraction capabilities.

{
  "url": "https://example.com",
  "extraction_strategy": "llm",
  "include_images": true,
  "llm_provider": "openai/gpt-4"
}

crawl

Comprehensive site crawling via sitemap or recursive strategies.

{
  "url": "https://example.com",
  "strategy": "sitemap",
  "max_depth": 3,
  "max_pages": 100,
  "include_external": false
}

crawl_repo

GitHub repository cloning and analysis.

{
  "repo_url": "https://github.com/user/repo",
  "branch": "main",
  "include_docs": true,
  "file_patterns": ["*.py", "*.md"]
}

crawl_dir

Local directory processing with file type detection.

{
  "directory_path": "/path/to/documents",
  "recursive": true,
  "file_extensions": [".pdf", ".txt", ".doc"],
  "max_files": 1000
}

rag_query

Semantic search across all crawled content.

{
  "query": "machine learning best practices",
  "limit": 10,
  "threshold": 0.7,
  "source_filter": "github"
}

list_sources

Enumerate and filter all crawled sources.

{
  "source_type": "webpage",
  "domain": "example.com",
  "date_from": "2024-01-01",
  "limit": 50
}

šŸ—ļø Architecture

ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”    ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”    ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│   FastMCP 2.0   │    │   Crawl4AI 0.7.0 │    │  Qdrant Vector  │
│  MCP Server     │◄──►│  Web Crawler     │◄──►│   Database      │
│                 │    │                  │    │                 │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜    ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜    ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
         │                        │                        │
         │                        │                        │
         ā–¼                        ā–¼                        ā–¼
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”    ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”    ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│  Streamable     │    │  HF TEI Server   │    │   Middleware    │
│  HTTP Transport │    │ Qwen3-Embed-0.6B │    │   & Logging     │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜    ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜    ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

šŸ“Š Technology Stack

ComponentTechnologyPurpose
MCP FrameworkFastMCP 2.0Server framework with advanced features
Web CrawlerCrawl4AI 0.7.0AI-optimized web crawling and extraction
Vector DBQdrantHigh-performance vector storage
EmbeddingsQwen3-Embedding-0.6BMultilingual text embeddings
InferenceHF TEIOptimized embedding generation
OrchestrationDocker ComposeService deployment and management
LanguagePython 3.9+Primary development language

šŸ”§ Configuration

Environment Variables

# Qdrant Configuration
QDRANT_HOST=localhost
QDRANT_PORT=6333

# HF TEI Configuration
TEI_HOST=localhost
TEI_PORT=8080

# Server Configuration
MCP_HOST=127.0.0.1
MCP_PORT=8000
LOG_LEVEL=INFO

Docker Services

The docker-compose.yml includes:

  • Qdrant: Vector database on port 6333
  • HF TEI: Embedding server on port 8080 with Qwen3-Embedding-0.6B

šŸ“š Advanced Usage

Custom Extraction Strategies

# LLM-based extraction
extraction_config = {
    "strategy": "llm",
    "provider": "openai/gpt-4",
    "schema": {
        "title": "string",
        "summary": "string",
        "topics": "array"
    }
}

Deep Crawling Configuration

# BFS strategy with filtering
crawl_config = {
    "strategy": "bfs",
    "max_depth": 5,
    "domain_filter": ["example.com", "docs.example.com"],
    "content_filter": {
        "min_content_length": 100,
        "exclude_patterns": ["/admin", "/api"]
    }
}

RAG Query Enhancement

# Hybrid search with metadata filtering
query_config = {
    "query": "API documentation",
    "hybrid_search": True,
    "metadata_filters": {
        "content_type": "documentation",
        "language": "en",
        "crawl_depth": {"$lte": 3}
    }
}

šŸ” Metadata Strategy

Each crawled item includes comprehensive metadata:

Content Metadata

  • URL, title, description, keywords
  • Language detection and content type
  • Extracted entities and topics

Technical Metadata

  • Crawl timestamp and processing metrics
  • HTTP headers and response status
  • Content hash for deduplication

Quality Metrics

  • Extraction confidence scores
  • Content completeness percentage
  • Error and retry counts

šŸš€ Development

Project Structure

src/
ā”œā”€ā”€ server.py              # FastMCP server entry point
ā”œā”€ā”€ crawlers/              # Web crawling implementations
ā”œā”€ā”€ rag/                   # RAG and vector operations
ā”œā”€ā”€ middleware/            # Request/response middleware
ā”œā”€ā”€ models/                # Pydantic data models
└── utils/                 # Configuration and utilities

Running Tests

# Unit tests
pytest tests/unit/

# Integration tests
pytest tests/integration/

# Performance tests
pytest tests/performance/

Development Setup

# Install development dependencies
pip install -r requirements-dev.txt

# Run linting
black src/ tests/
flake8 src/ tests/

# Type checking
mypy src/

šŸ“ˆ Performance

Benchmarks

  • Crawling Speed: 50+ pages/minute (typical web content)
  • Embedding Generation: 1000+ texts/minute via HF TEI
  • Search Latency: <100ms for semantic queries
  • Memory Usage: ~2GB RAM for moderate workloads

Optimization Tips

  • Use CSS extraction for structured, repetitive data
  • Enable adaptive crawling for unknown sites
  • Batch process embeddings for better throughput
  • Configure appropriate Qdrant collection settings

šŸ›”ļø Security

  • Input validation and sanitization
  • Rate limiting and abuse prevention
  • Secure credential management
  • Network isolation via Docker
  • Comprehensive audit logging

šŸ“„ License

This project is licensed under the MIT License - see the file for details.

šŸ¤ Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

šŸ“ž Support

  • Documentation: See for detailed technical specifications
  • Issues: Report bugs and request features via GitHub Issues
  • Discussions: Join community discussions for help and ideas

šŸ™ Acknowledgments


Crawlerr - Intelligent web crawling meets powerful semantic search šŸš€