mcp_crawl4ai

codemonkeying/mcp_crawl4ai

3.2

If you are the rightful owner of mcp_crawl4ai and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

Crawl4AI MCP Server is a robust Model Context Protocol server designed for advanced web crawling with AI-enhanced content extraction, leveraging the Crawl4AI framework.

Tools
  1. crawl_url

    Extract content from a single URL with comprehensive metadata.

  2. batch_crawl

    Process multiple URLs in a single batch operation.

  3. get_page_links

    Extract all links from a webpage with optional domain filtering.

  4. extract_tables

    Extract structured data from HTML tables on a webpage.

  5. health_check

    Comprehensive server health and feature detection.

Crawl4AI MCP Server

A powerful Model Context Protocol (MCP) server that provides advanced web crawling capabilities with AI-enhanced content extraction, built on the Crawl4AI framework.

šŸš€ Features

Core Web Crawling

  • Single URL Crawling: Extract content from individual web pages with intelligent parsing
  • Batch Processing: Crawl multiple URLs efficiently with progress tracking
  • Content Extraction: Convert web pages to clean Markdown format
  • Link Discovery: Extract and filter links from web pages
  • Table Extraction: Identify and extract structured data from HTML tables

Advanced Capabilities

  • Magic Mode: AI-enhanced content extraction for better quality
  • Intelligent Caching: Configurable caching system for performance optimization
  • Browser Automation: Full browser rendering with Playwright integration
  • Headless/GUI Modes: Flexible browser operation modes
  • Progress Reporting: Real-time progress updates for batch operations

MCP Compliance

  • FastMCP Integration: Built with modern FastMCP Python SDK
  • Centralized Logging: Comprehensive logging to project root
  • Health Monitoring: Detailed health checks with feature detection
  • Statistics Tracking: Comprehensive crawl statistics and metrics
  • Error Handling: Robust error handling with detailed logging

šŸ”§ Available Tools

crawl_url(url, headless=True, magic=True)

Extract content from a single URL with comprehensive metadata.

Parameters:

  • url (string): The URL to crawl
  • headless (boolean): Run browser in headless mode (default: true)
  • magic (boolean): Use AI-enhanced extraction (default: true)

Returns:

  • Content length and HTML length
  • Tables found with structure information
  • Links count and domain information
  • Output file path and timestamp
  • Content excerpt (first 500 characters)

batch_crawl(urls, headless=True)

Process multiple URLs in a single batch operation.

Parameters:

  • urls (array): List of URLs to crawl
  • headless (boolean): Run browser in headless mode (default: true)

Returns:

  • Total URLs processed
  • Success/failure counts
  • Individual results for each URL
  • Batch output directory
  • Summary JSON file

get_page_links(url, filter_domain=False)

Extract all links from a webpage with optional domain filtering.

Parameters:

  • url (string): The URL to extract links from
  • filter_domain (boolean): Only return links from same domain (default: false)

Returns:

  • Total links found
  • First 100 links (to prevent overwhelming responses)
  • Domain filtering status
  • Has more indicator

extract_tables(url)

Extract structured data from HTML tables on a webpage.

Parameters:

  • url (string): The URL to extract tables from

Returns:

  • Number of tables found
  • Table data as 2D arrays
  • Table structure information

health_check()

Comprehensive server health and feature detection.

Returns:

  • Server status and uptime
  • Crawler connectivity test
  • Optional features availability (PyTorch, Transformers, etc.)
  • Crawl statistics
  • Configuration information

get_readme()

Retrieve this documentation via MCP.

šŸ“‹ Example Usage

Basic Web Crawling

User: "Please crawl https://example.com and extract the main content"
Tool: crawl_url("https://example.com", headless=true, magic=true)
Result: {
  "url": "https://example.com",
  "domain": "example.com",
  "content_length": 1247,
  "tables_found": 0,
  "links_count": 15,
  "output_file": "20250709_181500_example_com.md",
  "excerpt": "Example Domain\n\nThis domain is for use in illustrative examples..."
}

Batch Processing

User: "Crawl these news sites: cnn.com, bbc.com, reuters.com"
Tool: batch_crawl(["https://cnn.com", "https://bbc.com", "https://reuters.com"])
Result: {
  "total_urls": 3,
  "successful": 3,
  "failed": 0,
  "output_directory": "./output/batch_20250709_181500",
  "results": [...]
}

Link Extraction

User: "Find all links on the Wikipedia homepage"
Tool: get_page_links("https://wikipedia.org", filter_domain=false)
Result: {
  "url": "https://wikipedia.org",
  "total_links": 247,
  "links": ["https://en.wikipedia.org", "https://es.wikipedia.org", ...],
  "has_more": true
}

Table Extraction

User: "Extract tables from this financial report"
Tool: extract_tables("https://example.com/financial-report")
Result: {
  "url": "https://example.com/financial-report",
  "tables_found": 2,
  "tables": [
    {
      "index": 0,
      "data": [["Quarter", "Revenue", "Profit"], ["Q1", "$100M", "$20M"], ...]
    }
  ]
}

Health Check

Tool: health_check()
Result: {
  "status": "healthy",
  "server_name": "Crawl4AI Server",
  "uptime_seconds": 3600,
  "connection_status": "connected",
  "crawl_statistics": {
    "total_crawls": 45,
    "successful_crawls": 43,
    "failed_crawls": 2
  },
  "optional_features": {
    "torch_features": "installed (version 2.0.1)",
    "transformer_models": "installed"
  }
}

āš™ļø Configuration

Environment Variables

  • CRAWL4AI_DEBUG=1 - Enable debug logging
  • CACHE_ENABLED=true - Enable intelligent caching (default: true)
  • DEFAULT_OUTPUT_DIR=./output - Output directory for crawled content
  • HEADLESS_DEFAULT=true - Default browser mode
  • VIEWPORT_WIDTH=1920 - Browser viewport width
  • VIEWPORT_HEIGHT=1080 - Browser viewport height

Optional Features

The server supports optional advanced features:

PyTorch Integration (~600MB-2GB):

pip install torch torchvision

Transformer Models:

pip install transformers

Cosine Similarity:

pip install numpy scipy

šŸ› ļø Installation

Automated Setup

cd servers/crawl4ai
./setup_crawl4ai_server.sh

Manual Setup

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install "mcp[cli]"
pip install "crawl4ai>=0.3.0"
pip install python-dotenv pyyaml

# Install browser dependencies
pip install playwright
playwright install chromium

# Create output directory
mkdir -p output

šŸ” Testing

Basic Functionality Test

cd servers/crawl4ai
source venv/bin/activate
python3 -c "
from crawl4ai import WebCrawler, BrowserConfig
config = BrowserConfig(headless=True)
with WebCrawler(config=config) as crawler:
    result = crawler.run('https://example.com')
    print(f'Success: {len(result.markdown)} characters extracted')
"

HTTP Bridge Test

curl -X POST "http://localhost:8000/crawl4ai/health_check" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{}'

šŸ“ Output Structure

Crawled content is organized as follows:

output/
ā”œā”€ā”€ 20250709_181500_example_com.md          # Single crawl
ā”œā”€ā”€ 20250709_182000_github_com.md           # Single crawl
└── batch_20250709_183000/                  # Batch crawl
    ā”œā”€ā”€ 001_cnn_com.md
    ā”œā”€ā”€ 002_bbc_com.md
    ā”œā”€ā”€ 003_reuters_com.md
    └── _summary.json                       # Batch summary

šŸ”’ Security Features

  • Headless Mode: Default secure browser operation
  • Domain Filtering: Optional link filtering by domain
  • Rate Limiting: Built-in crawler rate limiting
  • Safe File Naming: Automatic sanitization of output filenames
  • Error Isolation: Robust error handling prevents crashes

🚨 Troubleshooting

Common Issues

Browser Installation Failed:

# Reinstall Playwright browsers
pip install --force-reinstall playwright
playwright install chromium

Memory Issues with Large Pages:

  • Reduce viewport size in configuration
  • Enable headless mode
  • Use selective content extraction

Timeout Errors:

  • Check network connectivity
  • Verify target website accessibility
  • Increase timeout values in configuration

Permission Errors:

# Fix output directory permissions
chmod 755 output/

Debug Mode

Enable detailed logging:

export CRAWL4AI_DEBUG=1
python server.py

šŸ“Š Performance

Benchmarks

  • Single Page: ~2-5 seconds per page
  • Batch Processing: ~3-8 seconds per page (parallel processing)
  • Memory Usage: ~100-300MB base + ~50MB per concurrent crawl
  • Storage: ~10-100KB per page (Markdown format)

Optimization Tips

  • Use headless mode for better performance
  • Enable caching for repeated crawls
  • Batch multiple URLs for efficiency
  • Filter unnecessary content types

šŸ·ļø Compliance

This server is fully compliant with MCP server standards:

  • āœ… FastMCP Python SDK integration
  • āœ… Synchronous function implementation
  • āœ… Proper {"result": value} response format
  • āœ… Centralized logging to project root
  • āœ… Health check endpoint with all required fields
  • āœ… Comprehensive error handling
  • āœ… Auto-discovery integration
  • āœ… Security-focused .gitignore

šŸ“š Dependencies

Core Dependencies

  • mcp[cli] - Model Context Protocol framework
  • crawl4ai>=0.3.0 - Web crawling engine
  • playwright - Browser automation
  • python-dotenv - Environment configuration
  • pyyaml - YAML configuration support

Optional Dependencies

  • torch - PyTorch for advanced ML features
  • transformers - Hugging Face transformers
  • numpy - Numerical computing
  • scipy - Scientific computing