sadiuysal/crawl4ai-mcp-server
If you are the rightful owner of crawl4ai-mcp-server and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
Crawl4AI MCP Server is a lightweight Model Context Protocol server that provides web scraping and crawling capabilities for AI agents.
Crawl4AI MCP Server
🕷️ A lightweight Model Context Protocol (MCP) server that exposes Crawl4AI web scraping and crawling capabilities as tools for AI agents.
Similar to Firecrawl's API but self-hosted and free. Perfect for integrating web scraping into your AI workflows with OpenAI Agents SDK, Cursor, Claude Code, and other MCP-compatible tools.
Features
- 🔧 MCP Tools: Exposes 4 powerful tools:
scrape,crawl,crawl_site,crawl_sitemapvia stdio MCP server - 🌐 Web Scraping: Single-page scraping with markdown extraction
- 🕷️ Web Crawling: Multi-page breadth-first crawling with depth control
- 🧠 Adaptive Crawling: Smart crawling that stops when sufficient content is gathered
- 🛡️ Safety: Blocks internal networks, localhost, and private IPs
- 📱 Agent Ready: Works with OpenAI Agents SDK, Cursor, and Claude Code
- ⚡ Fast: Powered by Playwright and Crawl4AI's optimized extraction
🚀 Quick Start
Choose between Docker (recommended) or manual installation:
Option A: Docker (Recommended) 🐳
Docker eliminates all setup complexity and provides a consistent environment:
Option A1: Use Pre-built Image (Fastest) ⚡
# No setup required! Just pull and run the published image
docker pull uysalsadi/crawl4ai-mcp-server:latest
# Test it works
python test-config.py
# Use directly in MCP configurations (see examples below)
Option A2: Build Yourself
# Clone the repository
git clone https://github.com/uysalsadi/crawl4ai-mcp-server.git
cd crawl4ai-mcp-server
# Quick build and test (simplified)
docker build -f Dockerfile.simple -t crawl4ai-mcp .
echo '{"jsonrpc": "2.0", "id": 1, "method": "initialize", "params": {"protocolVersion": "2024-11-05", "capabilities": {}, "clientInfo": {"name": "test", "version": "1.0"}}}' | docker run --rm -i crawl4ai-mcp
# Or use helper script (full build with Playwright)
./docker-run.sh build
./docker-run.sh test
./docker-run.sh run
Docker Quick Commands:
./docker-run.sh build- Build the image./docker-run.sh run- Run MCP server (stdio mode)./docker-run.sh test- Run smoke tests./docker-run.sh dev- Development mode with shell access
Option B: Manual Installation
# Clone and setup
git clone https://github.com/uysalsadi/crawl4ai-mcp-server.git
cd crawl4ai-mcp-server
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -r requirements.txt
# Install Playwright browsers
python -m playwright install chromium
# Test basic functionality
python -m crawler_agent.smoke_client
# Test adaptive crawling
python test_adaptive.py
Use with OpenAI Agents SDK
# Set your OpenAI API key
export OPENAI_API_KEY="your-key-here"
# Docker: Run the example agent
docker-compose run --rm -e OPENAI_API_KEY crawl4ai-mcp python -m crawler_agent.agents_example
# Manual: Run the example agent
python -m crawler_agent.agents_example
🛠️ Tools Reference
The MCP server exposes 4 production-ready tools with content hiding features:
scrape
Fetch a single URL and return markdown content.
Arguments:
url(required): The URL to scrapeoutput_dir(optional): If provided, persists content to disk and returns metadata onlycrawler: Optional Crawl4AI crawler config overridesbrowser: Optional Crawl4AI browser config overridesscript: Optional C4A-Script for page interactiontimeout_sec: Request timeout (default: 45s, max: 600s)
Returns (without output_dir):
{
"url": "https://example.com",
"markdown": "# Page Title\n\nContent...",
"links": ["https://example.com/page1", "..."],
"metadata": {}
}
Returns (with output_dir):
{
"run_id": "scrape_20250815_122811_4fb188",
"file_path": "/path/to/output/pages/example.com_index.md",
"manifest_path": "/path/to/output/manifest.json",
"bytes_written": 230
}
crawl
Multi-page breadth-first crawling with filtering and adaptive stopping.
Arguments:
seed_url(required): Starting URL for the crawlmax_depth: Maximum link depth to follow (default: 1, max: 4)max_pages: Maximum pages to crawl (default: 5, max: 100)same_domain_only: Stay within the same domain (default: true)include_patterns: Regex patterns URLs must matchexclude_patterns: Regex patterns to exclude URLsadaptive: Enable adaptive crawling (default: false)output_dir(optional): If provided, persists content to disk and returns metadata onlycrawler,browser,script,timeout_sec: Same as scrape
Returns (without output_dir):
{
"start_url": "https://example.com",
"pages": [
{
"url": "https://example.com/page1",
"markdown": "Content...",
"links": ["..."]
}
],
"total_pages": 3
}
Returns (with output_dir):
{
"run_id": "crawl_20250815_122828_464944",
"pages_ok": 3,
"pages_failed": 0,
"manifest_path": "/path/to/output/manifest.json",
"bytes_written": 690
}
crawl_site
Comprehensive site crawling with persistence (always requires output_dir).
Arguments:
entry_url(required): Starting URL for site crawloutput_dir(required): Directory to persist resultsmax_depth: Maximum crawl depth (default: 2, max: 6)max_pages: Maximum pages to crawl (default: 200, max: 5000)- Additional config options for filtering and performance
Returns:
{
"run_id": "site_20250815_122851_0e2455",
"output_dir": "/path/to/output",
"manifest_path": "/path/to/output/manifest.json",
"pages_ok": 15,
"pages_failed": 2,
"bytes_written": 45672
}
crawl_sitemap
Sitemap-based crawling with persistence (always requires output_dir).
Arguments:
sitemap_url(required): URL to sitemap.xmloutput_dir(required): Directory to persist resultsmax_entries: Maximum sitemap entries to process (default: 1000)- Additional config options for filtering and performance
Returns:
{
"run_id": "sitemap_20250815_123006_667d71",
"output_dir": "/path/to/output",
"manifest_path": "/path/to/output/manifest.json",
"pages_ok": 25,
"pages_failed": 0,
"bytes_written": 123456
}
💡 Usage Examples
Standalone MCP Server
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
params = StdioServerParameters(
command="python",
args=["-m", "crawler_agent.mcp_server"]
)
async with stdio_client(params) as (read, write):
async with ClientSession(read, write) as session:
await session.initialize()
# Scrape a single page
result = await session.call_tool("scrape", {
"url": "https://example.com"
})
# Crawl with adaptive stopping
result = await session.call_tool("crawl", {
"seed_url": "https://docs.example.com",
"max_pages": 10,
"adaptive": True
})
OpenAI Agents SDK Integration
from agents import Agent, Runner
from agents.mcp.server import MCPServerStdio
async with MCPServerStdio(
params={
"command": "python",
"args": ["-m", "crawler_agent.mcp_server"]
},
cache_tools_list=True
) as server:
agent = Agent(
name="Research Assistant",
instructions="Use scrape and crawl tools to research topics.",
mcp_servers=[server],
mcp_config={"convert_schemas_to_strict": True}
)
result = await Runner.run(
agent,
"Research the latest AI safety papers"
)
Cursor/Claude Code Integration
This project supports both Cursor and Claude Code with synchronized configuration:
Cursor Setup
- Uses
.cursorrulesfor AI assistant behavior and project guidance - Automatic MCP tool detection and integration
- Project-specific rules and workflow guidance
Claude Code Setup
- Uses
CLAUDE.mdfor comprehensive project context and memory bank - Native MCP integration via
mcp__crawl4ai-mcp__*tool calls - Global config:
~/.claude/claude_desktop_config.jsonor project config:.mcp.json
Docker Integration for Both Environments
Both Cursor and Claude Code can use the Dockerized MCP server:
// .mcp.json (project-scoped configuration)
{
"mcpServers": {
"crawl4ai-mcp": {
"command": "docker",
"args": [
"run", "--rm", "-i",
"--volume", "./crawls:/app/crawls",
"uysalsadi/crawl4ai-mcp-server:latest"
],
"env": {
"CRAWL4AI_MCP_LOG": "INFO"
}
}
}
}
Dual Environment Features
- Synchronized Documentation: Changes to documentation are maintained across both environments
- Shared MCP Configuration: Both use the same MCP server and tool schemas
- Docker Compatibility: Consistent environment across both Cursor and Claude Code
- Cross-Compatible: Projects work seamlessly whether using Cursor or Claude Code
🐳 Docker Usage
Quick Start with Docker
The Docker approach eliminates all manual setup and provides a consistent environment:
# 1. Clone and build
git clone https://github.com/uysalsadi/crawl4ai-mcp-server.git
cd crawl4ai-mcp-server
./docker-run.sh build
# 2. Test the installation
./docker-run.sh test
# 3. Run MCP server
./docker-run.sh run
Docker Commands Reference
| Command | Description |
|---|---|
./docker-run.sh build | Build the Docker image |
./docker-run.sh run | Run MCP server in stdio mode |
./docker-run.sh test | Run smoke tests |
./docker-run.sh dev | Development mode with shell access |
./docker-run.sh stop | Stop running containers |
./docker-run.sh clean | Remove containers and images |
./docker-run.sh logs | Show container logs |
Docker Compose Services
The docker-compose.yml provides multiple service configurations:
crawl4ai-mcp: Production MCP servercrawl4ai-mcp-dev: Development container with shell accesscrawl4ai-mcp-test: Test runner for smoke tests
MCP Integration with Docker
Global MCP Configuration (Claude Code)
Add to ~/.claude/claude_desktop_config.json:
{
"mcpServers": {
"crawl4ai-mcp": {
"command": "docker",
"args": [
"run", "--rm", "-i",
"--volume", "/tmp/crawl4ai-crawls:/app/crawls",
"uysalsadi/crawl4ai-mcp-server:latest"
],
"env": {
"CRAWL4AI_MCP_LOG": "INFO"
}
}
}
}
✅ Copy this exact config - it uses the published Docker image!
Global MCP Configuration (Cursor)
Add to ~/.cursor/mcp.json:
{
"mcpServers": {
"crawl4ai-mcp": {
"command": "docker",
"args": [
"run", "--rm", "-i",
"--volume", "/tmp/crawl4ai-crawls:/app/crawls",
"uysalsadi/crawl4ai-mcp-server:latest"
],
"env": {
"CRAWL4AI_MCP_LOG": "INFO"
}
}
}
}
✅ This configuration is tested and working!
Project-Scoped Configuration
Add to your project's .mcp.json:
{
"mcpServers": {
"crawl4ai-mcp": {
"command": "docker",
"args": [
"run", "--rm", "-i",
"--volume", "./crawls:/app/crawls",
"uysalsadi/crawl4ai-mcp-server:latest"
],
"env": {
"CRAWL4AI_MCP_LOG": "INFO"
}
}
}
}
✅ This project already includes this configuration - see .mcp.json
Configuration Validation
After setting up any MCP configuration:
-
Test the Docker image works:
python test-config.py -
Restart your editor (Cursor/Claude Code) to reload MCP configuration
-
Verify tools are available:
- Look for
crawl4ai-mcpin the MCP tools panel - Should see 4 tools:
scrape,crawl,crawl_site,crawl_sitemap
- Look for
If tools don't appear, check:
- Docker is running and image is accessible
- MCP configuration file syntax is valid JSON
- Editor has been restarted after config changes
Docker Advantages
✅ Zero Setup: No need for Python venv, pip, or Playwright installation
✅ Consistent Environment: Same behavior across all platforms
✅ Isolated Dependencies: No conflicts with your system Python
✅ Easy Updates: docker pull to get latest version
✅ Portable: Works anywhere Docker runs
✅ Volume Persistence: Crawl outputs saved to host filesystem
Environment Variables
Set these when running Docker containers:
# Using docker-compose
OPENAI_API_KEY=your-key docker-compose up crawl4ai-mcp
# Using docker run directly
docker run --rm -i \
-e OPENAI_API_KEY=your-key \
-e CRAWL4AI_MCP_LOG=DEBUG \
-v ./crawls:/app/crawls \
crawl4ai-mcp-server:latest
⚙️ Configuration
Environment Variables
TARGET_URL: Default URL for smoke testing (default: https://modelcontextprotocol.io/docs)RESEARCH_TASK: Custom research task for agents exampleOPENAI_API_KEY: Required for OpenAI Agents SDK
Safety Settings
The server blocks these URL patterns by default:
localhost,127.0.0.1,::1- Private IP ranges (RFC 1918)
file://schemes.local,.internal,.landomains
🚀 Advanced Features
Adaptive Crawling
When adaptive: true is set, the crawler uses a simple content-based stopping strategy:
- Stops when total content exceeds 5,000 characters
- Prevents over-crawling for information gathering tasks
- More sophisticated LLM-based strategies can be added
Custom Filtering
Use regex patterns to control which URLs are crawled:
await session.call_tool("crawl", {
"seed_url": "https://docs.example.com",
"include_patterns": [r"/docs/", r"/api/"],
"exclude_patterns": [r"/old/", r"\.pdf$"]
})
Browser Configuration
Pass custom browser/crawler settings:
await session.call_tool("scrape", {
"url": "https://example.com",
"browser": {"headless": True, "viewport": {"width": 1280, "height": 720}},
"crawler": {"verbose": True}
})
🏗️ Architecture
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ AI Agent │───▶│ MCP Server │───▶│ Crawl4AI │
│ (Cursor/Agents) │ │ (stdio/stdio) │ │ (Playwright) │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│
▼
┌──────────────────┐
│ Safety Guards │
│ (URL validation) │
└──────────────────┘
📦 Publishing to Docker Hub
For maintainers who want to publish updates to the Docker registry:
Publishing Process
# 1. Login to Docker Hub (one time setup)
./docker-push.sh login
# 2. Build, push, and test everything
./docker-push.sh all
# Or do steps individually:
./docker-push.sh build # Build and tag image
./docker-push.sh push # Push to Docker Hub
./docker-push.sh test # Test the published image
Docker Hub Repository
The image is published at: uysalsadi/crawl4ai-mcp-server
- Latest:
uysalsadi/crawl4ai-mcp-server:latest - Versioned:
uysalsadi/crawl4ai-mcp-server:v1.0.0
Usage Statistics
Users can pull and use the image without any local setup:
docker pull uysalsadi/crawl4ai-mcp-server:latest
🔧 Development
Project Structure
crawler_agent/
├── __init__.py
├── mcp_server.py # Main MCP server
├── safety.py # URL safety validation
├── adaptive_strategy.py # Adaptive crawling logic
├── smoke_client.py # Basic testing client
└── agents_example.py # OpenAI Agents SDK example
Testing
# Test MCP server
python -m crawler_agent.smoke_client
# Test adaptive crawling
python test_adaptive.py
# Test with agents (requires OPENAI_API_KEY)
python -m crawler_agent.agents_example
Contributing
- Environment Setup: Always activate
.venvbefore development work - Documentation: Follow the Documentation Memory Rule - synchronize changes across CLAUDE.md, .cursorrules, and README.md
- Code Style: Follow
.cursorrulesfor Cursor orCLAUDE.mdfor Claude Code - Testing: Use
python -m crawler_agent.smoke_clientfor validation - Safety: Ensure security guards are maintained for all new features
- Dual Compatibility: Verify changes work in both Cursor and Claude Code environments
📚 References
📄 License
MIT License - see file for details.
🙏 Acknowledgments
This project uses Crawl4AI by UncleCode for web scraping capabilities. Crawl4AI is an excellent open-source LLM-friendly web crawler and scraper.
🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
⭐ Support
If this project helped you, please give it a star! It helps others discover the project.
🐛 Issues
Found a bug or have a feature request? Please open an issue.