mcp_crawl4ai by codemonkeying - MCP Server

Crawl4AI MCP Server

A powerful Model Context Protocol (MCP) server that provides advanced web crawling capabilities with AI-enhanced content extraction, built on the Crawl4AI framework.

🚀 Features

Core Web Crawling

Single URL Crawling: Extract content from individual web pages with intelligent parsing
Batch Processing: Crawl multiple URLs efficiently with progress tracking
Content Extraction: Convert web pages to clean Markdown format
Link Discovery: Extract and filter links from web pages
Table Extraction: Identify and extract structured data from HTML tables

Advanced Capabilities

Magic Mode: AI-enhanced content extraction for better quality
Intelligent Caching: Configurable caching system for performance optimization
Browser Automation: Full browser rendering with Playwright integration
Headless/GUI Modes: Flexible browser operation modes
Progress Reporting: Real-time progress updates for batch operations

MCP Compliance

FastMCP Integration: Built with modern FastMCP Python SDK
Centralized Logging: Comprehensive logging to project root
Health Monitoring: Detailed health checks with feature detection
Statistics Tracking: Comprehensive crawl statistics and metrics
Error Handling: Robust error handling with detailed logging

🔧 Available Tools

`crawl_url(url, headless=True, magic=True)`

Extract content from a single URL with comprehensive metadata.

Parameters:

url (string): The URL to crawl
headless (boolean): Run browser in headless mode (default: true)
magic (boolean): Use AI-enhanced extraction (default: true)

Returns:

Content length and HTML length
Tables found with structure information
Links count and domain information
Output file path and timestamp
Content excerpt (first 500 characters)

`batch_crawl(urls, headless=True)`

Process multiple URLs in a single batch operation.

Parameters:

urls (array): List of URLs to crawl
headless (boolean): Run browser in headless mode (default: true)

Returns:

Total URLs processed
Success/failure counts
Individual results for each URL
Batch output directory
Summary JSON file

`get_page_links(url, filter_domain=False)`

Extract all links from a webpage with optional domain filtering.

Parameters:

url (string): The URL to extract links from
filter_domain (boolean): Only return links from same domain (default: false)

Returns:

Total links found
First 100 links (to prevent overwhelming responses)
Domain filtering status
Has more indicator

`extract_tables(url)`

Extract structured data from HTML tables on a webpage.

Parameters:

url (string): The URL to extract tables from

Returns:

Number of tables found
Table data as 2D arrays
Table structure information

`health_check()`

Comprehensive server health and feature detection.

Returns:

Server status and uptime
Crawler connectivity test
Optional features availability (PyTorch, Transformers, etc.)
Crawl statistics
Configuration information

`get_readme()`

Retrieve this documentation via MCP.

📋 Example Usage

Basic Web Crawling

User: "Please crawl https://example.com and extract the main content"
Tool: crawl_url("https://example.com", headless=true, magic=true)
Result: {
  "url": "https://example.com",
  "domain": "example.com",
  "content_length": 1247,
  "tables_found": 0,
  "links_count": 15,
  "output_file": "20250709_181500_example_com.md",
  "excerpt": "Example Domain\n\nThis domain is for use in illustrative examples..."
}

Batch Processing

User: "Crawl these news sites: cnn.com, bbc.com, reuters.com"
Tool: batch_crawl(["https://cnn.com", "https://bbc.com", "https://reuters.com"])
Result: {
  "total_urls": 3,
  "successful": 3,
  "failed": 0,
  "output_directory": "./output/batch_20250709_181500",
  "results": [...]
}

Link Extraction

User: "Find all links on the Wikipedia homepage"
Tool: get_page_links("https://wikipedia.org", filter_domain=false)
Result: {
  "url": "https://wikipedia.org",
  "total_links": 247,
  "links": ["https://en.wikipedia.org", "https://es.wikipedia.org", ...],
  "has_more": true
}

Table Extraction

User: "Extract tables from this financial report"
Tool: extract_tables("https://example.com/financial-report")
Result: {
  "url": "https://example.com/financial-report",
  "tables_found": 2,
  "tables": [
    {
      "index": 0,
      "data": [["Quarter", "Revenue", "Profit"], ["Q1", "$100M", "$20M"], ...]
    }
  ]
}

Health Check

Tool: health_check()
Result: {
  "status": "healthy",
  "server_name": "Crawl4AI Server",
  "uptime_seconds": 3600,
  "connection_status": "connected",
  "crawl_statistics": {
    "total_crawls": 45,
    "successful_crawls": 43,
    "failed_crawls": 2
  },
  "optional_features": {
    "torch_features": "installed (version 2.0.1)",
    "transformer_models": "installed"
  }
}

⚙️ Configuration

Environment Variables

CRAWL4AI_DEBUG=1 - Enable debug logging
CACHE_ENABLED=true - Enable intelligent caching (default: true)
DEFAULT_OUTPUT_DIR=./output - Output directory for crawled content
HEADLESS_DEFAULT=true - Default browser mode
VIEWPORT_WIDTH=1920 - Browser viewport width
VIEWPORT_HEIGHT=1080 - Browser viewport height

Optional Features

The server supports optional advanced features:

PyTorch Integration (~600MB-2GB):

pip install torch torchvision

Transformer Models:

pip install transformers

Cosine Similarity:

pip install numpy scipy

🛠️ Installation

Automated Setup

cd servers/crawl4ai
./setup_crawl4ai_server.sh

Manual Setup

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install "mcp[cli]"
pip install "crawl4ai>=0.3.0"
pip install python-dotenv pyyaml

# Install browser dependencies
pip install playwright
playwright install chromium

# Create output directory
mkdir -p output

🔍 Testing

Basic Functionality Test

cd servers/crawl4ai
source venv/bin/activate
python3 -c "
from crawl4ai import WebCrawler, BrowserConfig
config = BrowserConfig(headless=True)
with WebCrawler(config=config) as crawler:
    result = crawler.run('https://example.com')
    print(f'Success: {len(result.markdown)} characters extracted')
"

HTTP Bridge Test

curl -X POST "http://localhost:8000/crawl4ai/health_check" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{}'

📁 Output Structure

Crawled content is organized as follows:

output/
├── 20250709_181500_example_com.md          # Single crawl
├── 20250709_182000_github_com.md           # Single crawl
└── batch_20250709_183000/                  # Batch crawl
    ├── 001_cnn_com.md
    ├── 002_bbc_com.md
    ├── 003_reuters_com.md
    └── _summary.json                       # Batch summary

🔒 Security Features

Headless Mode: Default secure browser operation
Domain Filtering: Optional link filtering by domain
Rate Limiting: Built-in crawler rate limiting
Safe File Naming: Automatic sanitization of output filenames
Error Isolation: Robust error handling prevents crashes

🚨 Troubleshooting

Common Issues

Browser Installation Failed:

# Reinstall Playwright browsers
pip install --force-reinstall playwright
playwright install chromium

Memory Issues with Large Pages:

Reduce viewport size in configuration
Enable headless mode
Use selective content extraction

Timeout Errors:

Check network connectivity
Verify target website accessibility
Increase timeout values in configuration

Permission Errors:

# Fix output directory permissions
chmod 755 output/

Debug Mode

Enable detailed logging:

export CRAWL4AI_DEBUG=1
python server.py

📊 Performance

Benchmarks

Single Page: ~2-5 seconds per page
Batch Processing: ~3-8 seconds per page (parallel processing)
Memory Usage: ~100-300MB base + ~50MB per concurrent crawl
Storage: ~10-100KB per page (Markdown format)

Optimization Tips

Use headless mode for better performance
Enable caching for repeated crawls
Batch multiple URLs for efficiency
Filter unnecessary content types

🏷️ Compliance

This server is fully compliant with MCP server standards:

✅ FastMCP Python SDK integration
✅ Synchronous function implementation
✅ Proper {"result": value} response format
✅ Centralized logging to project root
✅ Health check endpoint with all required fields
✅ Comprehensive error handling
✅ Auto-discovery integration
✅ Security-focused .gitignore

📚 Dependencies

Core Dependencies

mcp[cli] - Model Context Protocol framework
crawl4ai>=0.3.0 - Web crawling engine
playwright - Browser automation
python-dotenv - Environment configuration
pyyaml - YAML configuration support

Optional Dependencies

torch - PyTorch for advanced ML features
transformers - Hugging Face transformers
numpy - Numerical computing
scipy - Scientific computing