jmagar/crawlerr
If you are the rightful owner of crawlerr and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
Crawlerr is an advanced RAG-enabled web crawling MCP server that integrates semantic search capabilities with web crawling, built using FastMCP 2.0, Crawl4AI 0.7.0, Qdrant vector database, and Qwen3-Embedding-0.6B.
Crawlerr - Advanced RAG-Enabled Web Crawling MCP Server
A powerful, production-ready Model Context Protocol (MCP) server that combines advanced web crawling capabilities with semantic search through RAG (Retrieval-Augmented Generation). Built with FastMCP 2.0, Crawl4AI 0.7.0, Qdrant vector database, and Qwen3-Embedding-0.6B for multilingual semantic understanding.
⨠Features
š·ļø Advanced Web Crawling
- Single Page Scraping: Extract content, metadata, images, and links from individual pages
- Site-wide Crawling: Intelligent sitemap.xml parsing with recursive fallback
- Repository Crawling: Clone and analyze GitHub repositories with code-aware processing
- Directory Crawling: Process local file systems with document format support
- Adaptive Intelligence: Smart crawling that knows when to stop based on content sufficiency
š§ RAG-Powered Search
- Semantic Search: Vector-based similarity search using Qwen3-Embedding-0.6B
- Hybrid Search: Combines semantic and keyword-based filtering
- Metadata-Rich Results: Comprehensive source tracking and context preservation
- Query Expansion: Intelligent query enhancement for better results
šļø Modern Architecture
- FastMCP 2.0: Streamable HTTP transport with real-time progress updates
- Qdrant Vector DB: High-performance vector storage and retrieval
- HF TEI Integration: Optimized embedding generation via Text Embeddings Inference
- Docker Compose: Containerized deployment with service orchestration
š ļø Enterprise Features
- Middleware Support: Logging, error handling, and progress tracking
- Resource Endpoints: Expose crawled data as MCP resources
- Prompt Templates: Reusable crawling and analysis templates
- Component Management: Runtime tool enabling/disabling
- Comprehensive Monitoring: Health checks, metrics, and observability
š Quick Start
Prerequisites
- Docker and Docker Compose
- Python 3.9+
- 4GB+ RAM for optimal performance
1. Clone and Setup
git clone <repository-url>
cd crawlerr
2. Start Services
# Start Qdrant and HF TEI services
docker-compose up -d
# Wait for services to be healthy
docker-compose ps
3. Install Dependencies
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install packages
pip install -r requirements.txt
4. Run the Server
# Start the MCP server
python src/server.py
The server will be available at http://localhost:8000
with streamable HTTP transport.
š ļø Available Tools
scrape
Single page crawling with advanced extraction capabilities.
{
"url": "https://example.com",
"extraction_strategy": "llm",
"include_images": true,
"llm_provider": "openai/gpt-4"
}
crawl
Comprehensive site crawling via sitemap or recursive strategies.
{
"url": "https://example.com",
"strategy": "sitemap",
"max_depth": 3,
"max_pages": 100,
"include_external": false
}
crawl_repo
GitHub repository cloning and analysis.
{
"repo_url": "https://github.com/user/repo",
"branch": "main",
"include_docs": true,
"file_patterns": ["*.py", "*.md"]
}
crawl_dir
Local directory processing with file type detection.
{
"directory_path": "/path/to/documents",
"recursive": true,
"file_extensions": [".pdf", ".txt", ".doc"],
"max_files": 1000
}
rag_query
Semantic search across all crawled content.
{
"query": "machine learning best practices",
"limit": 10,
"threshold": 0.7,
"source_filter": "github"
}
list_sources
Enumerate and filter all crawled sources.
{
"source_type": "webpage",
"domain": "example.com",
"date_from": "2024-01-01",
"limit": 50
}
šļø Architecture
āāāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāā
ā FastMCP 2.0 ā ā Crawl4AI 0.7.0 ā ā Qdrant Vector ā
ā MCP Server āāāāāŗā Web Crawler āāāāāŗā Database ā
ā ā ā ā ā ā
āāāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāā
ā ā ā
ā ā ā
ā¼ ā¼ ā¼
āāāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāā
ā Streamable ā ā HF TEI Server ā ā Middleware ā
ā HTTP Transport ā ā Qwen3-Embed-0.6B ā ā & Logging ā
āāāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāā
š Technology Stack
Component | Technology | Purpose |
---|---|---|
MCP Framework | FastMCP 2.0 | Server framework with advanced features |
Web Crawler | Crawl4AI 0.7.0 | AI-optimized web crawling and extraction |
Vector DB | Qdrant | High-performance vector storage |
Embeddings | Qwen3-Embedding-0.6B | Multilingual text embeddings |
Inference | HF TEI | Optimized embedding generation |
Orchestration | Docker Compose | Service deployment and management |
Language | Python 3.9+ | Primary development language |
š§ Configuration
Environment Variables
# Qdrant Configuration
QDRANT_HOST=localhost
QDRANT_PORT=6333
# HF TEI Configuration
TEI_HOST=localhost
TEI_PORT=8080
# Server Configuration
MCP_HOST=127.0.0.1
MCP_PORT=8000
LOG_LEVEL=INFO
Docker Services
The docker-compose.yml
includes:
- Qdrant: Vector database on port 6333
- HF TEI: Embedding server on port 8080 with Qwen3-Embedding-0.6B
š Advanced Usage
Custom Extraction Strategies
# LLM-based extraction
extraction_config = {
"strategy": "llm",
"provider": "openai/gpt-4",
"schema": {
"title": "string",
"summary": "string",
"topics": "array"
}
}
Deep Crawling Configuration
# BFS strategy with filtering
crawl_config = {
"strategy": "bfs",
"max_depth": 5,
"domain_filter": ["example.com", "docs.example.com"],
"content_filter": {
"min_content_length": 100,
"exclude_patterns": ["/admin", "/api"]
}
}
RAG Query Enhancement
# Hybrid search with metadata filtering
query_config = {
"query": "API documentation",
"hybrid_search": True,
"metadata_filters": {
"content_type": "documentation",
"language": "en",
"crawl_depth": {"$lte": 3}
}
}
š Metadata Strategy
Each crawled item includes comprehensive metadata:
Content Metadata
- URL, title, description, keywords
- Language detection and content type
- Extracted entities and topics
Technical Metadata
- Crawl timestamp and processing metrics
- HTTP headers and response status
- Content hash for deduplication
Quality Metrics
- Extraction confidence scores
- Content completeness percentage
- Error and retry counts
š Development
Project Structure
src/
āāā server.py # FastMCP server entry point
āāā crawlers/ # Web crawling implementations
āāā rag/ # RAG and vector operations
āāā middleware/ # Request/response middleware
āāā models/ # Pydantic data models
āāā utils/ # Configuration and utilities
Running Tests
# Unit tests
pytest tests/unit/
# Integration tests
pytest tests/integration/
# Performance tests
pytest tests/performance/
Development Setup
# Install development dependencies
pip install -r requirements-dev.txt
# Run linting
black src/ tests/
flake8 src/ tests/
# Type checking
mypy src/
š Performance
Benchmarks
- Crawling Speed: 50+ pages/minute (typical web content)
- Embedding Generation: 1000+ texts/minute via HF TEI
- Search Latency: <100ms for semantic queries
- Memory Usage: ~2GB RAM for moderate workloads
Optimization Tips
- Use CSS extraction for structured, repetitive data
- Enable adaptive crawling for unknown sites
- Batch process embeddings for better throughput
- Configure appropriate Qdrant collection settings
š”ļø Security
- Input validation and sanitization
- Rate limiting and abuse prevention
- Secure credential management
- Network isolation via Docker
- Comprehensive audit logging
š License
This project is licensed under the MIT License - see the file for details.
š¤ Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
š Support
- Documentation: See for detailed technical specifications
- Issues: Report bugs and request features via GitHub Issues
- Discussions: Join community discussions for help and ideas
š Acknowledgments
- FastMCP - Modern MCP server framework
- Crawl4AI - AI-optimized web crawler
- Qdrant - Vector similarity search engine
- Qwen Team - Multilingual embedding models
- Hugging Face - Text Embeddings Inference
Crawlerr - Intelligent web crawling meets powerful semantic search š