codemonkeying/mcp_crawl4ai
If you are the rightful owner of mcp_crawl4ai and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
Crawl4AI MCP Server is a robust Model Context Protocol server designed for advanced web crawling with AI-enhanced content extraction, leveraging the Crawl4AI framework.
crawl_url
Extract content from a single URL with comprehensive metadata.
batch_crawl
Process multiple URLs in a single batch operation.
get_page_links
Extract all links from a webpage with optional domain filtering.
extract_tables
Extract structured data from HTML tables on a webpage.
health_check
Comprehensive server health and feature detection.
Crawl4AI MCP Server
A powerful Model Context Protocol (MCP) server that provides advanced web crawling capabilities with AI-enhanced content extraction, built on the Crawl4AI framework.
š Features
Core Web Crawling
- Single URL Crawling: Extract content from individual web pages with intelligent parsing
- Batch Processing: Crawl multiple URLs efficiently with progress tracking
- Content Extraction: Convert web pages to clean Markdown format
- Link Discovery: Extract and filter links from web pages
- Table Extraction: Identify and extract structured data from HTML tables
Advanced Capabilities
- Magic Mode: AI-enhanced content extraction for better quality
- Intelligent Caching: Configurable caching system for performance optimization
- Browser Automation: Full browser rendering with Playwright integration
- Headless/GUI Modes: Flexible browser operation modes
- Progress Reporting: Real-time progress updates for batch operations
MCP Compliance
- FastMCP Integration: Built with modern FastMCP Python SDK
- Centralized Logging: Comprehensive logging to project root
- Health Monitoring: Detailed health checks with feature detection
- Statistics Tracking: Comprehensive crawl statistics and metrics
- Error Handling: Robust error handling with detailed logging
š§ Available Tools
crawl_url(url, headless=True, magic=True)
Extract content from a single URL with comprehensive metadata.
Parameters:
url
(string): The URL to crawlheadless
(boolean): Run browser in headless mode (default: true)magic
(boolean): Use AI-enhanced extraction (default: true)
Returns:
- Content length and HTML length
- Tables found with structure information
- Links count and domain information
- Output file path and timestamp
- Content excerpt (first 500 characters)
batch_crawl(urls, headless=True)
Process multiple URLs in a single batch operation.
Parameters:
urls
(array): List of URLs to crawlheadless
(boolean): Run browser in headless mode (default: true)
Returns:
- Total URLs processed
- Success/failure counts
- Individual results for each URL
- Batch output directory
- Summary JSON file
get_page_links(url, filter_domain=False)
Extract all links from a webpage with optional domain filtering.
Parameters:
url
(string): The URL to extract links fromfilter_domain
(boolean): Only return links from same domain (default: false)
Returns:
- Total links found
- First 100 links (to prevent overwhelming responses)
- Domain filtering status
- Has more indicator
extract_tables(url)
Extract structured data from HTML tables on a webpage.
Parameters:
url
(string): The URL to extract tables from
Returns:
- Number of tables found
- Table data as 2D arrays
- Table structure information
health_check()
Comprehensive server health and feature detection.
Returns:
- Server status and uptime
- Crawler connectivity test
- Optional features availability (PyTorch, Transformers, etc.)
- Crawl statistics
- Configuration information
get_readme()
Retrieve this documentation via MCP.
š Example Usage
Basic Web Crawling
User: "Please crawl https://example.com and extract the main content"
Tool: crawl_url("https://example.com", headless=true, magic=true)
Result: {
"url": "https://example.com",
"domain": "example.com",
"content_length": 1247,
"tables_found": 0,
"links_count": 15,
"output_file": "20250709_181500_example_com.md",
"excerpt": "Example Domain\n\nThis domain is for use in illustrative examples..."
}
Batch Processing
User: "Crawl these news sites: cnn.com, bbc.com, reuters.com"
Tool: batch_crawl(["https://cnn.com", "https://bbc.com", "https://reuters.com"])
Result: {
"total_urls": 3,
"successful": 3,
"failed": 0,
"output_directory": "./output/batch_20250709_181500",
"results": [...]
}
Link Extraction
User: "Find all links on the Wikipedia homepage"
Tool: get_page_links("https://wikipedia.org", filter_domain=false)
Result: {
"url": "https://wikipedia.org",
"total_links": 247,
"links": ["https://en.wikipedia.org", "https://es.wikipedia.org", ...],
"has_more": true
}
Table Extraction
User: "Extract tables from this financial report"
Tool: extract_tables("https://example.com/financial-report")
Result: {
"url": "https://example.com/financial-report",
"tables_found": 2,
"tables": [
{
"index": 0,
"data": [["Quarter", "Revenue", "Profit"], ["Q1", "$100M", "$20M"], ...]
}
]
}
Health Check
Tool: health_check()
Result: {
"status": "healthy",
"server_name": "Crawl4AI Server",
"uptime_seconds": 3600,
"connection_status": "connected",
"crawl_statistics": {
"total_crawls": 45,
"successful_crawls": 43,
"failed_crawls": 2
},
"optional_features": {
"torch_features": "installed (version 2.0.1)",
"transformer_models": "installed"
}
}
āļø Configuration
Environment Variables
CRAWL4AI_DEBUG=1
- Enable debug loggingCACHE_ENABLED=true
- Enable intelligent caching (default: true)DEFAULT_OUTPUT_DIR=./output
- Output directory for crawled contentHEADLESS_DEFAULT=true
- Default browser modeVIEWPORT_WIDTH=1920
- Browser viewport widthVIEWPORT_HEIGHT=1080
- Browser viewport height
Optional Features
The server supports optional advanced features:
PyTorch Integration (~600MB-2GB):
pip install torch torchvision
Transformer Models:
pip install transformers
Cosine Similarity:
pip install numpy scipy
š ļø Installation
Automated Setup
cd servers/crawl4ai
./setup_crawl4ai_server.sh
Manual Setup
# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install "mcp[cli]"
pip install "crawl4ai>=0.3.0"
pip install python-dotenv pyyaml
# Install browser dependencies
pip install playwright
playwright install chromium
# Create output directory
mkdir -p output
š Testing
Basic Functionality Test
cd servers/crawl4ai
source venv/bin/activate
python3 -c "
from crawl4ai import WebCrawler, BrowserConfig
config = BrowserConfig(headless=True)
with WebCrawler(config=config) as crawler:
result = crawler.run('https://example.com')
print(f'Success: {len(result.markdown)} characters extracted')
"
HTTP Bridge Test
curl -X POST "http://localhost:8000/crawl4ai/health_check" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{}'
š Output Structure
Crawled content is organized as follows:
output/
āāā 20250709_181500_example_com.md # Single crawl
āāā 20250709_182000_github_com.md # Single crawl
āāā batch_20250709_183000/ # Batch crawl
āāā 001_cnn_com.md
āāā 002_bbc_com.md
āāā 003_reuters_com.md
āāā _summary.json # Batch summary
š Security Features
- Headless Mode: Default secure browser operation
- Domain Filtering: Optional link filtering by domain
- Rate Limiting: Built-in crawler rate limiting
- Safe File Naming: Automatic sanitization of output filenames
- Error Isolation: Robust error handling prevents crashes
šØ Troubleshooting
Common Issues
Browser Installation Failed:
# Reinstall Playwright browsers
pip install --force-reinstall playwright
playwright install chromium
Memory Issues with Large Pages:
- Reduce viewport size in configuration
- Enable headless mode
- Use selective content extraction
Timeout Errors:
- Check network connectivity
- Verify target website accessibility
- Increase timeout values in configuration
Permission Errors:
# Fix output directory permissions
chmod 755 output/
Debug Mode
Enable detailed logging:
export CRAWL4AI_DEBUG=1
python server.py
š Performance
Benchmarks
- Single Page: ~2-5 seconds per page
- Batch Processing: ~3-8 seconds per page (parallel processing)
- Memory Usage: ~100-300MB base + ~50MB per concurrent crawl
- Storage: ~10-100KB per page (Markdown format)
Optimization Tips
- Use headless mode for better performance
- Enable caching for repeated crawls
- Batch multiple URLs for efficiency
- Filter unnecessary content types
š·ļø Compliance
This server is fully compliant with MCP server standards:
- ā FastMCP Python SDK integration
- ā Synchronous function implementation
- ā
Proper
{"result": value}
response format - ā Centralized logging to project root
- ā Health check endpoint with all required fields
- ā Comprehensive error handling
- ā Auto-discovery integration
- ā Security-focused .gitignore
š Dependencies
Core Dependencies
mcp[cli]
- Model Context Protocol frameworkcrawl4ai>=0.3.0
- Web crawling engineplaywright
- Browser automationpython-dotenv
- Environment configurationpyyaml
- YAML configuration support
Optional Dependencies
torch
- PyTorch for advanced ML featurestransformers
- Hugging Face transformersnumpy
- Numerical computingscipy
- Scientific computing