walksoda/crawl-mcp
If you are the rightful owner of crawl-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
Crawl4AI MCP Server is a comprehensive Model Context Protocol server that integrates the powerful crawl4ai library, offering advanced web crawling, content extraction, and AI-powered analysis capabilities through a standardized MCP interface.
Crawl-MCP: Unofficial MCP Server for crawl4ai
⚠️ Important: This is an unofficial MCP server implementation for the excellent crawl4ai library.
Not affiliated with the original crawl4ai project.
A comprehensive Model Context Protocol (MCP) server that wraps the powerful crawl4ai library with advanced AI capabilities. Extract and analyze content from any source: web pages, PDFs, Office documents, YouTube videos, and more. Features intelligent summarization to dramatically reduce token usage while preserving key information.
🌟 Key Features
- 🔍 Google Search Integration - 7 optimized search genres with Google official operators
- 🔍 Advanced Web Crawling: JavaScript support, deep site mapping, entity extraction
- 🌐 Universal Content Extraction: Web pages, PDFs, Word docs, Excel, PowerPoint, ZIP archives
- 🤖 AI-Powered Summarization: Smart token reduction (up to 88.5%) while preserving essential information
- 🎬 YouTube Integration: Extract video transcripts and summaries without API keys
- ⚡ Production Ready: 21 specialized tools with comprehensive error handling
🚀 Quick Start
Prerequisites (Required First)
Install system dependencies for Playwright:
Ubuntu 24.04 LTS (Manual Required):
# Manual setup required due to t64 library transition
sudo apt update && sudo apt install -y \
libnss3 libatk-bridge2.0-0 libxss1 libasound2t64 \
libgbm1 libgtk-3-0t64 libxshmfence-dev libxrandr2 \
libxcomposite1 libxcursor1 libxdamage1 libxi6 \
fonts-noto-color-emoji fonts-unifont python3-venv python3-pip
python3 -m venv venv && source venv/bin/activate
pip install playwright==1.54.0 && playwright install chromium
sudo playwright install-deps
Other Linux/macOS:
sudo bash scripts/prepare_for_uvx_playwright.sh
Windows (as Administrator):
scripts/prepare_for_uvx_playwright.ps1
Installation
UVX (Recommended - Easiest):
# After system preparation above - that's it!
uvx --from git+https://github.com/walksoda/crawl-mcp crawl-mcp
Docker (Production-Ready):
# Clone the repository
git clone https://github.com/walksoda/crawl-mcp
cd crawl-mcp
# Build and run with Docker Compose (STDIO mode)
docker-compose up --build
# Or build and run HTTP mode on port 8000
docker-compose --profile http up --build crawl4ai-mcp-http
# Or build manually
docker build -t crawl4ai-mcp .
docker run -it crawl4ai-mcp
Docker Features:
- 🔧 Multi-Browser Support: Chromium, Firefox, Webkit headless browsers
- 🐧 Google Chrome: Additional Chrome Stable for compatibility
- ⚡ Optimized Performance: Pre-configured browser flags for Docker
- 🔒 Security: Non-root user execution
- 📦 Complete Dependencies: All required libraries included
Claude Desktop Setup
UVX Installation:
Add to your claude_desktop_config.json
:
{
"mcpServers": {
"crawl-mcp": {
"transport": "stdio",
"command": "uvx",
"args": [
"--from",
"git+https://github.com/walksoda/crawl-mcp",
"crawl-mcp"
],
"env": {
"CRAWL4AI_LANG": "en"
}
}
}
}
Docker HTTP Mode:
{
"mcpServers": {
"crawl-mcp": {
"transport": "http",
"baseUrl": "http://localhost:8000"
}
}
}
For Japanese interface:
"env": {
"CRAWL4AI_LANG": "ja"
}
📖 Documentation
Topic | Description |
---|---|
Complete installation instructions for all platforms | |
Full tool documentation and usage examples | |
Platform-specific setup configurations | |
HTTP API access and integration methods | |
Power user techniques and workflows | |
Contributing and development setup |
Language-Specific Documentation
- English: directory
- 日本語: directory
🛠️ Tool Overview
Web Crawling
crawl_url
- Single page crawling with JavaScript supportdeep_crawl_site
- Multi-page site mapping and explorationcrawl_url_with_fallback
- Robust crawling with retry strategiesbatch_crawl
- Process multiple URLs simultaneously
AI-Powered Analysis
intelligent_extract
- Semantic content extraction with custom instructionsauto_summarize
- LLM-based summarization for large contentextract_entities
- Pattern-based entity extraction (emails, phones, URLs, etc.)
Media Processing
process_file
- Convert PDFs, Office docs, ZIP archives to markdownextract_youtube_transcript
- Multi-language transcript extractionbatch_extract_youtube_transcripts
- Process multiple videos
Search Integration
search_google
- Genre-filtered Google search with metadatasearch_and_crawl
- Combined search and content extractionbatch_search_google
- Multiple search queries with analysis
🎯 Common Use Cases
Content Research:
search_and_crawl → intelligent_extract → structured analysis
Documentation Mining:
deep_crawl_site → batch processing → comprehensive extraction
Media Analysis:
extract_youtube_transcript → auto_summarize → insight generation
Competitive Intelligence:
batch_crawl → extract_entities → comparative analysis
🚨 Quick Troubleshooting
Installation Issues:
- Run system diagnostics: Use
get_system_diagnostics
tool - Re-run setup scripts with proper privileges
- Try development installation method
Performance Issues:
- Use
wait_for_js: true
for JavaScript-heavy sites - Increase timeout for slow-loading pages
- Enable
auto_summarize
for large content
Configuration Issues:
- Check JSON syntax in
claude_desktop_config.json
- Verify file paths are absolute
- Restart Claude Desktop after configuration changes
🏗️ Project Structure
- Original Library: crawl4ai by unclecode
- MCP Wrapper: This repository (walksoda)
- Implementation: Unofficial third-party integration
📄 License
This project is an unofficial wrapper around the crawl4ai library. Please refer to the original crawl4ai license for the underlying functionality.
🤝 Contributing
See our for contribution guidelines and development setup instructions.
🔗 Related Projects
- crawl4ai - The underlying web crawling library
- Model Context Protocol - The standard this server implements
- Claude Desktop - Primary client for MCP servers