sadiuysal/crawl4ai-mcp-server
If you are the rightful owner of crawl4ai-mcp-server and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
Crawl4AI MCP Server is a lightweight Model Context Protocol server that provides web scraping and crawling capabilities for AI agents.
Crawl4AI MCP Server
π·οΈ A lightweight Model Context Protocol (MCP) server that exposes Crawl4AI web scraping and crawling capabilities as tools for AI agents.
Similar to Firecrawl's API but self-hosted and free. Perfect for integrating web scraping into your AI workflows with OpenAI Agents SDK, Cursor, Claude Code, and other MCP-compatible tools.
Features
- π§ MCP Tools: Exposes 4 powerful tools:
scrape
,crawl
,crawl_site
,crawl_sitemap
via stdio MCP server - π Web Scraping: Single-page scraping with markdown extraction
- π·οΈ Web Crawling: Multi-page breadth-first crawling with depth control
- π§ Adaptive Crawling: Smart crawling that stops when sufficient content is gathered
- π‘οΈ Safety: Blocks internal networks, localhost, and private IPs
- π± Agent Ready: Works with OpenAI Agents SDK, Cursor, and Claude Code
- β‘ Fast: Powered by Playwright and Crawl4AI's optimized extraction
π Quick Start
Choose between Docker (recommended) or manual installation:
Option A: Docker (Recommended) π³
Docker eliminates all setup complexity and provides a consistent environment:
Option A1: Use Pre-built Image (Fastest) β‘
# No setup required! Just pull and run the published image
docker pull uysalsadi/crawl4ai-mcp-server:latest
# Test it works
python test-config.py
# Use directly in MCP configurations (see examples below)
Option A2: Build Yourself
# Clone the repository
git clone https://github.com/uysalsadi/crawl4ai-mcp-server.git
cd crawl4ai-mcp-server
# Quick build and test (simplified)
docker build -f Dockerfile.simple -t crawl4ai-mcp .
echo '{"jsonrpc": "2.0", "id": 1, "method": "initialize", "params": {"protocolVersion": "2024-11-05", "capabilities": {}, "clientInfo": {"name": "test", "version": "1.0"}}}' | docker run --rm -i crawl4ai-mcp
# Or use helper script (full build with Playwright)
./docker-run.sh build
./docker-run.sh test
./docker-run.sh run
Docker Quick Commands:
./docker-run.sh build
- Build the image./docker-run.sh run
- Run MCP server (stdio mode)./docker-run.sh test
- Run smoke tests./docker-run.sh dev
- Development mode with shell access
Option B: Manual Installation
# Clone and setup
git clone https://github.com/uysalsadi/crawl4ai-mcp-server.git
cd crawl4ai-mcp-server
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -r requirements.txt
# Install Playwright browsers
python -m playwright install chromium
# Test basic functionality
python -m crawler_agent.smoke_client
# Test adaptive crawling
python test_adaptive.py
Use with OpenAI Agents SDK
# Set your OpenAI API key
export OPENAI_API_KEY="your-key-here"
# Docker: Run the example agent
docker-compose run --rm -e OPENAI_API_KEY crawl4ai-mcp python -m crawler_agent.agents_example
# Manual: Run the example agent
python -m crawler_agent.agents_example
π οΈ Tools Reference
The MCP server exposes 4 production-ready tools with content hiding features:
scrape
Fetch a single URL and return markdown content.
Arguments:
url
(required): The URL to scrapeoutput_dir
(optional): If provided, persists content to disk and returns metadata onlycrawler
: Optional Crawl4AI crawler config overridesbrowser
: Optional Crawl4AI browser config overridesscript
: Optional C4A-Script for page interactiontimeout_sec
: Request timeout (default: 45s, max: 600s)
Returns (without output_dir):
{
"url": "https://example.com",
"markdown": "# Page Title\n\nContent...",
"links": ["https://example.com/page1", "..."],
"metadata": {}
}
Returns (with output_dir):
{
"run_id": "scrape_20250815_122811_4fb188",
"file_path": "/path/to/output/pages/example.com_index.md",
"manifest_path": "/path/to/output/manifest.json",
"bytes_written": 230
}
crawl
Multi-page breadth-first crawling with filtering and adaptive stopping.
Arguments:
seed_url
(required): Starting URL for the crawlmax_depth
: Maximum link depth to follow (default: 1, max: 4)max_pages
: Maximum pages to crawl (default: 5, max: 100)same_domain_only
: Stay within the same domain (default: true)include_patterns
: Regex patterns URLs must matchexclude_patterns
: Regex patterns to exclude URLsadaptive
: Enable adaptive crawling (default: false)output_dir
(optional): If provided, persists content to disk and returns metadata onlycrawler
,browser
,script
,timeout_sec
: Same as scrape
Returns (without output_dir):
{
"start_url": "https://example.com",
"pages": [
{
"url": "https://example.com/page1",
"markdown": "Content...",
"links": ["..."]
}
],
"total_pages": 3
}
Returns (with output_dir):
{
"run_id": "crawl_20250815_122828_464944",
"pages_ok": 3,
"pages_failed": 0,
"manifest_path": "/path/to/output/manifest.json",
"bytes_written": 690
}
crawl_site
Comprehensive site crawling with persistence (always requires output_dir).
Arguments:
entry_url
(required): Starting URL for site crawloutput_dir
(required): Directory to persist resultsmax_depth
: Maximum crawl depth (default: 2, max: 6)max_pages
: Maximum pages to crawl (default: 200, max: 5000)- Additional config options for filtering and performance
Returns:
{
"run_id": "site_20250815_122851_0e2455",
"output_dir": "/path/to/output",
"manifest_path": "/path/to/output/manifest.json",
"pages_ok": 15,
"pages_failed": 2,
"bytes_written": 45672
}
crawl_sitemap
Sitemap-based crawling with persistence (always requires output_dir).
Arguments:
sitemap_url
(required): URL to sitemap.xmloutput_dir
(required): Directory to persist resultsmax_entries
: Maximum sitemap entries to process (default: 1000)- Additional config options for filtering and performance
Returns:
{
"run_id": "sitemap_20250815_123006_667d71",
"output_dir": "/path/to/output",
"manifest_path": "/path/to/output/manifest.json",
"pages_ok": 25,
"pages_failed": 0,
"bytes_written": 123456
}
π‘ Usage Examples
Standalone MCP Server
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
params = StdioServerParameters(
command="python",
args=["-m", "crawler_agent.mcp_server"]
)
async with stdio_client(params) as (read, write):
async with ClientSession(read, write) as session:
await session.initialize()
# Scrape a single page
result = await session.call_tool("scrape", {
"url": "https://example.com"
})
# Crawl with adaptive stopping
result = await session.call_tool("crawl", {
"seed_url": "https://docs.example.com",
"max_pages": 10,
"adaptive": True
})
OpenAI Agents SDK Integration
from agents import Agent, Runner
from agents.mcp.server import MCPServerStdio
async with MCPServerStdio(
params={
"command": "python",
"args": ["-m", "crawler_agent.mcp_server"]
},
cache_tools_list=True
) as server:
agent = Agent(
name="Research Assistant",
instructions="Use scrape and crawl tools to research topics.",
mcp_servers=[server],
mcp_config={"convert_schemas_to_strict": True}
)
result = await Runner.run(
agent,
"Research the latest AI safety papers"
)
Cursor/Claude Code Integration
This project supports both Cursor and Claude Code with synchronized configuration:
Cursor Setup
- Uses
.cursorrules
for AI assistant behavior and project guidance - Automatic MCP tool detection and integration
- Project-specific rules and workflow guidance
Claude Code Setup
- Uses
CLAUDE.md
for comprehensive project context and memory bank - Native MCP integration via
mcp__crawl4ai-mcp__*
tool calls - Global config:
~/.claude/claude_desktop_config.json
or project config:.mcp.json
Docker Integration for Both Environments
Both Cursor and Claude Code can use the Dockerized MCP server:
// .mcp.json (project-scoped configuration)
{
"mcpServers": {
"crawl4ai-mcp": {
"command": "docker",
"args": [
"run", "--rm", "-i",
"--volume", "./crawls:/app/crawls",
"uysalsadi/crawl4ai-mcp-server:latest"
],
"env": {
"CRAWL4AI_MCP_LOG": "INFO"
}
}
}
}
Dual Environment Features
- Synchronized Documentation: Changes to documentation are maintained across both environments
- Shared MCP Configuration: Both use the same MCP server and tool schemas
- Docker Compatibility: Consistent environment across both Cursor and Claude Code
- Cross-Compatible: Projects work seamlessly whether using Cursor or Claude Code
π³ Docker Usage
Quick Start with Docker
The Docker approach eliminates all manual setup and provides a consistent environment:
# 1. Clone and build
git clone https://github.com/uysalsadi/crawl4ai-mcp-server.git
cd crawl4ai-mcp-server
./docker-run.sh build
# 2. Test the installation
./docker-run.sh test
# 3. Run MCP server
./docker-run.sh run
Docker Commands Reference
Command | Description |
---|---|
./docker-run.sh build | Build the Docker image |
./docker-run.sh run | Run MCP server in stdio mode |
./docker-run.sh test | Run smoke tests |
./docker-run.sh dev | Development mode with shell access |
./docker-run.sh stop | Stop running containers |
./docker-run.sh clean | Remove containers and images |
./docker-run.sh logs | Show container logs |
Docker Compose Services
The docker-compose.yml
provides multiple service configurations:
crawl4ai-mcp
: Production MCP servercrawl4ai-mcp-dev
: Development container with shell accesscrawl4ai-mcp-test
: Test runner for smoke tests
MCP Integration with Docker
Global MCP Configuration (Claude Code)
Add to ~/.claude/claude_desktop_config.json
:
{
"mcpServers": {
"crawl4ai-mcp": {
"command": "docker",
"args": [
"run", "--rm", "-i",
"--volume", "/tmp/crawl4ai-crawls:/app/crawls",
"uysalsadi/crawl4ai-mcp-server:latest"
],
"env": {
"CRAWL4AI_MCP_LOG": "INFO"
}
}
}
}
β Copy this exact config - it uses the published Docker image!
Global MCP Configuration (Cursor)
Add to ~/.cursor/mcp.json
:
{
"mcpServers": {
"crawl4ai-mcp": {
"command": "docker",
"args": [
"run", "--rm", "-i",
"--volume", "/tmp/crawl4ai-crawls:/app/crawls",
"uysalsadi/crawl4ai-mcp-server:latest"
],
"env": {
"CRAWL4AI_MCP_LOG": "INFO"
}
}
}
}
β This configuration is tested and working!
Project-Scoped Configuration
Add to your project's .mcp.json
:
{
"mcpServers": {
"crawl4ai-mcp": {
"command": "docker",
"args": [
"run", "--rm", "-i",
"--volume", "./crawls:/app/crawls",
"uysalsadi/crawl4ai-mcp-server:latest"
],
"env": {
"CRAWL4AI_MCP_LOG": "INFO"
}
}
}
}
β
This project already includes this configuration - see .mcp.json
Configuration Validation
After setting up any MCP configuration:
-
Test the Docker image works:
python test-config.py
-
Restart your editor (Cursor/Claude Code) to reload MCP configuration
-
Verify tools are available:
- Look for
crawl4ai-mcp
in the MCP tools panel - Should see 4 tools:
scrape
,crawl
,crawl_site
,crawl_sitemap
- Look for
If tools don't appear, check:
- Docker is running and image is accessible
- MCP configuration file syntax is valid JSON
- Editor has been restarted after config changes
Docker Advantages
β
Zero Setup: No need for Python venv, pip, or Playwright installation
β
Consistent Environment: Same behavior across all platforms
β
Isolated Dependencies: No conflicts with your system Python
β
Easy Updates: docker pull
to get latest version
β
Portable: Works anywhere Docker runs
β
Volume Persistence: Crawl outputs saved to host filesystem
Environment Variables
Set these when running Docker containers:
# Using docker-compose
OPENAI_API_KEY=your-key docker-compose up crawl4ai-mcp
# Using docker run directly
docker run --rm -i \
-e OPENAI_API_KEY=your-key \
-e CRAWL4AI_MCP_LOG=DEBUG \
-v ./crawls:/app/crawls \
crawl4ai-mcp-server:latest
βοΈ Configuration
Environment Variables
TARGET_URL
: Default URL for smoke testing (default: https://modelcontextprotocol.io/docs)RESEARCH_TASK
: Custom research task for agents exampleOPENAI_API_KEY
: Required for OpenAI Agents SDK
Safety Settings
The server blocks these URL patterns by default:
localhost
,127.0.0.1
,::1
- Private IP ranges (RFC 1918)
file://
schemes.local
,.internal
,.lan
domains
π Advanced Features
Adaptive Crawling
When adaptive: true
is set, the crawler uses a simple content-based stopping strategy:
- Stops when total content exceeds 5,000 characters
- Prevents over-crawling for information gathering tasks
- More sophisticated LLM-based strategies can be added
Custom Filtering
Use regex patterns to control which URLs are crawled:
await session.call_tool("crawl", {
"seed_url": "https://docs.example.com",
"include_patterns": [r"/docs/", r"/api/"],
"exclude_patterns": [r"/old/", r"\.pdf$"]
})
Browser Configuration
Pass custom browser/crawler settings:
await session.call_tool("scrape", {
"url": "https://example.com",
"browser": {"headless": True, "viewport": {"width": 1280, "height": 720}},
"crawler": {"verbose": True}
})
ποΈ Architecture
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β AI Agent βββββΆβ MCP Server βββββΆβ Crawl4AI β
β (Cursor/Agents) β β (stdio/stdio) β β (Playwright) β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β
βΌ
ββββββββββββββββββββ
β Safety Guards β
β (URL validation) β
ββββββββββββββββββββ
π¦ Publishing to Docker Hub
For maintainers who want to publish updates to the Docker registry:
Publishing Process
# 1. Login to Docker Hub (one time setup)
./docker-push.sh login
# 2. Build, push, and test everything
./docker-push.sh all
# Or do steps individually:
./docker-push.sh build # Build and tag image
./docker-push.sh push # Push to Docker Hub
./docker-push.sh test # Test the published image
Docker Hub Repository
The image is published at: uysalsadi/crawl4ai-mcp-server
- Latest:
uysalsadi/crawl4ai-mcp-server:latest
- Versioned:
uysalsadi/crawl4ai-mcp-server:v1.0.0
Usage Statistics
Users can pull and use the image without any local setup:
docker pull uysalsadi/crawl4ai-mcp-server:latest
π§ Development
Project Structure
crawler_agent/
βββ __init__.py
βββ mcp_server.py # Main MCP server
βββ safety.py # URL safety validation
βββ adaptive_strategy.py # Adaptive crawling logic
βββ smoke_client.py # Basic testing client
βββ agents_example.py # OpenAI Agents SDK example
Testing
# Test MCP server
python -m crawler_agent.smoke_client
# Test adaptive crawling
python test_adaptive.py
# Test with agents (requires OPENAI_API_KEY)
python -m crawler_agent.agents_example
Contributing
- Environment Setup: Always activate
.venv
before development work - Documentation: Follow the Documentation Memory Rule - synchronize changes across CLAUDE.md, .cursorrules, and README.md
- Code Style: Follow
.cursorrules
for Cursor orCLAUDE.md
for Claude Code - Testing: Use
python -m crawler_agent.smoke_client
for validation - Safety: Ensure security guards are maintained for all new features
- Dual Compatibility: Verify changes work in both Cursor and Claude Code environments
π References
π License
MIT License - see file for details.
π Acknowledgments
This project uses Crawl4AI by UncleCode for web scraping capabilities. Crawl4AI is an excellent open-source LLM-friendly web crawler and scraper.
π€ Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
β Support
If this project helped you, please give it a star! It helps others discover the project.
π Issues
Found a bug or have a feature request? Please open an issue.