crawlerr by jmagar - MCP Server

Crawlerr - Advanced RAG-Enabled Web Crawling MCP Server

A powerful, production-ready Model Context Protocol (MCP) server that combines advanced web crawling capabilities with semantic search through RAG (Retrieval-Augmented Generation). Built with FastMCP 2.0, Crawl4AI 0.7.0, Qdrant vector database, and Qwen3-Embedding-0.6B for multilingual semantic understanding.

✨ Features

🕷️ Advanced Web Crawling

Single Page Scraping: Extract content, metadata, images, and links from individual pages
Site-wide Crawling: Intelligent sitemap.xml parsing with recursive fallback
Repository Crawling: Clone and analyze GitHub repositories with code-aware processing
Directory Crawling: Process local file systems with document format support
Adaptive Intelligence: Smart crawling that knows when to stop based on content sufficiency

🧠 RAG-Powered Search

Semantic Search: Vector-based similarity search using Qwen3-Embedding-0.6B
Hybrid Search: Combines semantic and keyword-based filtering
Metadata-Rich Results: Comprehensive source tracking and context preservation
Query Expansion: Intelligent query enhancement for better results

🏗️ Modern Architecture

FastMCP 2.0: Streamable HTTP transport with real-time progress updates
Qdrant Vector DB: High-performance vector storage and retrieval
HF TEI Integration: Optimized embedding generation via Text Embeddings Inference
Docker Compose: Containerized deployment with service orchestration

🛠️ Enterprise Features

Middleware Support: Logging, error handling, and progress tracking
Resource Endpoints: Expose crawled data as MCP resources
Prompt Templates: Reusable crawling and analysis templates
Component Management: Runtime tool enabling/disabling
Comprehensive Monitoring: Health checks, metrics, and observability

🚀 Quick Start

Prerequisites

Docker and Docker Compose
Python 3.9+
4GB+ RAM for optimal performance

1. Clone and Setup

git clone <repository-url>
cd crawlerr

2. Start Services

# Start Qdrant and HF TEI services
docker-compose up -d

# Wait for services to be healthy
docker-compose ps

3. Install Dependencies

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install packages
pip install -r requirements.txt

4. Run the Server

# Start the MCP server
python src/server.py

The server will be available at http://localhost:8000 with streamable HTTP transport.

🛠️ Available Tools

`scrape`

Single page crawling with advanced extraction capabilities.

{
  "url": "https://example.com",
  "extraction_strategy": "llm",
  "include_images": true,
  "llm_provider": "openai/gpt-4"
}

`crawl`

Comprehensive site crawling via sitemap or recursive strategies.

{
  "url": "https://example.com",
  "strategy": "sitemap",
  "max_depth": 3,
  "max_pages": 100,
  "include_external": false
}

`crawl_repo`

GitHub repository cloning and analysis.

{
  "repo_url": "https://github.com/user/repo",
  "branch": "main",
  "include_docs": true,
  "file_patterns": ["*.py", "*.md"]
}

`crawl_dir`

Local directory processing with file type detection.

{
  "directory_path": "/path/to/documents",
  "recursive": true,
  "file_extensions": [".pdf", ".txt", ".doc"],
  "max_files": 1000
}

`rag_query`

Semantic search across all crawled content.

{
  "query": "machine learning best practices",
  "limit": 10,
  "threshold": 0.7,
  "source_filter": "github"
}

`list_sources`

Enumerate and filter all crawled sources.

{
  "source_type": "webpage",
  "domain": "example.com",
  "date_from": "2024-01-01",
  "limit": 50
}

🏗️ Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   FastMCP 2.0   │    │   Crawl4AI 0.7.0 │    │  Qdrant Vector  │
│  MCP Server     │◄──►│  Web Crawler     │◄──►│   Database      │
│                 │    │                  │    │                 │
└─────────────────┘    └──────────────────┘    └─────────────────┘
         │                        │                        │
         │                        │                        │
         ▼                        ▼                        ▼
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│  Streamable     │    │  HF TEI Server   │    │   Middleware    │
│  HTTP Transport │    │ Qwen3-Embed-0.6B │    │   & Logging     │
└─────────────────┘    └──────────────────┘    └─────────────────┘

📊 Technology Stack

Component	Technology	Purpose
MCP Framework	FastMCP 2.0	Server framework with advanced features
Web Crawler	Crawl4AI 0.7.0	AI-optimized web crawling and extraction
Vector DB	Qdrant	High-performance vector storage
Embeddings	Qwen3-Embedding-0.6B	Multilingual text embeddings
Inference	HF TEI	Optimized embedding generation
Orchestration	Docker Compose	Service deployment and management
Language	Python 3.9+	Primary development language

🔧 Configuration

Environment Variables

# Qdrant Configuration
QDRANT_HOST=localhost
QDRANT_PORT=6333

# HF TEI Configuration
TEI_HOST=localhost
TEI_PORT=8080

# Server Configuration
MCP_HOST=127.0.0.1
MCP_PORT=8000
LOG_LEVEL=INFO

Docker Services

The docker-compose.yml includes:

Qdrant: Vector database on port 6333
HF TEI: Embedding server on port 8080 with Qwen3-Embedding-0.6B

📚 Advanced Usage

Custom Extraction Strategies

# LLM-based extraction
extraction_config = {
    "strategy": "llm",
    "provider": "openai/gpt-4",
    "schema": {
        "title": "string",
        "summary": "string",
        "topics": "array"
    }
}

Deep Crawling Configuration

# BFS strategy with filtering
crawl_config = {
    "strategy": "bfs",
    "max_depth": 5,
    "domain_filter": ["example.com", "docs.example.com"],
    "content_filter": {
        "min_content_length": 100,
        "exclude_patterns": ["/admin", "/api"]
    }
}

RAG Query Enhancement

# Hybrid search with metadata filtering
query_config = {
    "query": "API documentation",
    "hybrid_search": True,
    "metadata_filters": {
        "content_type": "documentation",
        "language": "en",
        "crawl_depth": {"$lte": 3}
    }
}

🔍 Metadata Strategy

Each crawled item includes comprehensive metadata:

Content Metadata

URL, title, description, keywords
Language detection and content type
Extracted entities and topics

Technical Metadata

Crawl timestamp and processing metrics
HTTP headers and response status
Content hash for deduplication

Quality Metrics

Extraction confidence scores
Content completeness percentage
Error and retry counts

🚀 Development

Project Structure

src/
├── server.py              # FastMCP server entry point
├── crawlers/              # Web crawling implementations
├── rag/                   # RAG and vector operations
├── middleware/            # Request/response middleware
├── models/                # Pydantic data models
└── utils/                 # Configuration and utilities

Running Tests

# Unit tests
pytest tests/unit/

# Integration tests
pytest tests/integration/

# Performance tests
pytest tests/performance/

Development Setup

# Install development dependencies
pip install -r requirements-dev.txt

# Run linting
black src/ tests/
flake8 src/ tests/

# Type checking
mypy src/

📈 Performance

Benchmarks

Crawling Speed: 50+ pages/minute (typical web content)
Embedding Generation: 1000+ texts/minute via HF TEI
Search Latency: <100ms for semantic queries
Memory Usage: ~2GB RAM for moderate workloads

Optimization Tips

Use CSS extraction for structured, repetitive data
Enable adaptive crawling for unknown sites
Batch process embeddings for better throughput
Configure appropriate Qdrant collection settings

🛡️ Security

Input validation and sanitization
Rate limiting and abuse prevention
Secure credential management
Network isolation via Docker
Comprehensive audit logging

📄 License

This project is licensed under the MIT License - see the file for details.

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📞 Support

Documentation: See for detailed technical specifications
Issues: Report bugs and request features via GitHub Issues
Discussions: Join community discussions for help and ideas

🙏 Acknowledgments

FastMCP - Modern MCP server framework
Crawl4AI - AI-optimized web crawler
Qdrant - Vector similarity search engine
Qwen Team - Multilingual embedding models
Hugging Face - Text Embeddings Inference

Crawlerr - Intelligent web crawling meets powerful semantic search 🚀