crawl4ai-mcp-server by sadiuysal - MCP Server

Crawl4AI MCP Server

🕷️ A lightweight Model Context Protocol (MCP) server that exposes Crawl4AI web scraping and crawling capabilities as tools for AI agents.

Similar to Firecrawl's API but self-hosted and free. Perfect for integrating web scraping into your AI workflows with OpenAI Agents SDK, Cursor, Claude Code, and other MCP-compatible tools.

Features

🔧 MCP Tools: Exposes 4 powerful tools: scrape, crawl, crawl_site, crawl_sitemap via stdio MCP server
🌐 Web Scraping: Single-page scraping with markdown extraction
🕷️ Web Crawling: Multi-page breadth-first crawling with depth control
🧠 Adaptive Crawling: Smart crawling that stops when sufficient content is gathered
🛡️ Safety: Blocks internal networks, localhost, and private IPs
📱 Agent Ready: Works with OpenAI Agents SDK, Cursor, and Claude Code
⚡ Fast: Powered by Playwright and Crawl4AI's optimized extraction

🚀 Quick Start

Choose between Docker (recommended) or manual installation:

Option A: Docker (Recommended) 🐳

Docker eliminates all setup complexity and provides a consistent environment:

Option A1: Use Pre-built Image (Fastest) ⚡

# No setup required! Just pull and run the published image
docker pull uysalsadi/crawl4ai-mcp-server:latest

# Test it works
python test-config.py

# Use directly in MCP configurations (see examples below)

Option A2: Build Yourself

# Clone the repository
git clone https://github.com/uysalsadi/crawl4ai-mcp-server.git
cd crawl4ai-mcp-server

# Quick build and test (simplified)
docker build -f Dockerfile.simple -t crawl4ai-mcp .
echo '{"jsonrpc": "2.0", "id": 1, "method": "initialize", "params": {"protocolVersion": "2024-11-05", "capabilities": {}, "clientInfo": {"name": "test", "version": "1.0"}}}' | docker run --rm -i crawl4ai-mcp

# Or use helper script (full build with Playwright)
./docker-run.sh build
./docker-run.sh test
./docker-run.sh run

Docker Quick Commands:

./docker-run.sh build - Build the image
./docker-run.sh run - Run MCP server (stdio mode)
./docker-run.sh test - Run smoke tests
./docker-run.sh dev - Development mode with shell access

Option B: Manual Installation

# Clone and setup
git clone https://github.com/uysalsadi/crawl4ai-mcp-server.git
cd crawl4ai-mcp-server
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

# Install Playwright browsers
python -m playwright install chromium

# Test basic functionality
python -m crawler_agent.smoke_client

# Test adaptive crawling
python test_adaptive.py

Use with OpenAI Agents SDK

# Set your OpenAI API key
export OPENAI_API_KEY="your-key-here"

# Docker: Run the example agent
docker-compose run --rm -e OPENAI_API_KEY crawl4ai-mcp python -m crawler_agent.agents_example

# Manual: Run the example agent
python -m crawler_agent.agents_example

🛠️ Tools Reference

The MCP server exposes 4 production-ready tools with content hiding features:

`scrape`

Fetch a single URL and return markdown content.

Arguments:

url (required): The URL to scrape
output_dir (optional): If provided, persists content to disk and returns metadata only
crawler: Optional Crawl4AI crawler config overrides
browser: Optional Crawl4AI browser config overrides
script: Optional C4A-Script for page interaction
timeout_sec: Request timeout (default: 45s, max: 600s)

Returns (without output_dir):

{
  "url": "https://example.com",
  "markdown": "# Page Title\n\nContent...",
  "links": ["https://example.com/page1", "..."],
  "metadata": {}
}

Returns (with output_dir):

{
  "run_id": "scrape_20250815_122811_4fb188",
  "file_path": "/path/to/output/pages/example.com_index.md",
  "manifest_path": "/path/to/output/manifest.json",
  "bytes_written": 230
}

`crawl`

Multi-page breadth-first crawling with filtering and adaptive stopping.

Arguments:

seed_url (required): Starting URL for the crawl
max_depth: Maximum link depth to follow (default: 1, max: 4)
max_pages: Maximum pages to crawl (default: 5, max: 100)
same_domain_only: Stay within the same domain (default: true)
include_patterns: Regex patterns URLs must match
exclude_patterns: Regex patterns to exclude URLs
adaptive: Enable adaptive crawling (default: false)
output_dir (optional): If provided, persists content to disk and returns metadata only
crawler, browser, script, timeout_sec: Same as scrape

Returns (without output_dir):

{
  "start_url": "https://example.com",
  "pages": [
    {
      "url": "https://example.com/page1",
      "markdown": "Content...",
      "links": ["..."]
    }
  ],
  "total_pages": 3
}

Returns (with output_dir):

{
  "run_id": "crawl_20250815_122828_464944",
  "pages_ok": 3,
  "pages_failed": 0,
  "manifest_path": "/path/to/output/manifest.json",
  "bytes_written": 690
}

`crawl_site`

Comprehensive site crawling with persistence (always requires output_dir).

Arguments:

entry_url (required): Starting URL for site crawl
output_dir (required): Directory to persist results
max_depth: Maximum crawl depth (default: 2, max: 6)
max_pages: Maximum pages to crawl (default: 200, max: 5000)
Additional config options for filtering and performance

Returns:

{
  "run_id": "site_20250815_122851_0e2455",
  "output_dir": "/path/to/output",
  "manifest_path": "/path/to/output/manifest.json",
  "pages_ok": 15,
  "pages_failed": 2,
  "bytes_written": 45672
}

`crawl_sitemap`

Sitemap-based crawling with persistence (always requires output_dir).

Arguments:

sitemap_url (required): URL to sitemap.xml
output_dir (required): Directory to persist results
max_entries: Maximum sitemap entries to process (default: 1000)
Additional config options for filtering and performance

Returns:

{
  "run_id": "sitemap_20250815_123006_667d71",
  "output_dir": "/path/to/output",
  "manifest_path": "/path/to/output/manifest.json", 
  "pages_ok": 25,
  "pages_failed": 0,
  "bytes_written": 123456
}

💡 Usage Examples

Standalone MCP Server

from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

params = StdioServerParameters(
    command="python",
    args=["-m", "crawler_agent.mcp_server"]
)

async with stdio_client(params) as (read, write):
    async with ClientSession(read, write) as session:
        await session.initialize()
        
        # Scrape a single page
        result = await session.call_tool("scrape", {
            "url": "https://example.com"
        })
        
        # Crawl with adaptive stopping
        result = await session.call_tool("crawl", {
            "seed_url": "https://docs.example.com",
            "max_pages": 10,
            "adaptive": True
        })

OpenAI Agents SDK Integration

from agents import Agent, Runner
from agents.mcp.server import MCPServerStdio

async with MCPServerStdio(
    params={
        "command": "python", 
        "args": ["-m", "crawler_agent.mcp_server"]
    },
    cache_tools_list=True
) as server:
    
    agent = Agent(
        name="Research Assistant",
        instructions="Use scrape and crawl tools to research topics.",
        mcp_servers=[server],
        mcp_config={"convert_schemas_to_strict": True}
    )
    
    result = await Runner.run(
        agent, 
        "Research the latest AI safety papers"
    )

Cursor/Claude Code Integration

This project supports both Cursor and Claude Code with synchronized configuration:

Cursor Setup

Uses .cursorrules for AI assistant behavior and project guidance
Automatic MCP tool detection and integration
Project-specific rules and workflow guidance

Claude Code Setup

Uses CLAUDE.md for comprehensive project context and memory bank
Native MCP integration via mcp__crawl4ai-mcp__* tool calls
Global config: ~/.claude/claude_desktop_config.json or project config: .mcp.json

Docker Integration for Both Environments

Both Cursor and Claude Code can use the Dockerized MCP server:

// .mcp.json (project-scoped configuration)
{
  "mcpServers": {
    "crawl4ai-mcp": {
      "command": "docker",
      "args": [
        "run", "--rm", "-i",
        "--volume", "./crawls:/app/crawls",
        "uysalsadi/crawl4ai-mcp-server:latest"
      ],
      "env": {
        "CRAWL4AI_MCP_LOG": "INFO"
      }
    }
  }
}

Dual Environment Features

Synchronized Documentation: Changes to documentation are maintained across both environments
Shared MCP Configuration: Both use the same MCP server and tool schemas
Docker Compatibility: Consistent environment across both Cursor and Claude Code
Cross-Compatible: Projects work seamlessly whether using Cursor or Claude Code

🐳 Docker Usage

Quick Start with Docker

The Docker approach eliminates all manual setup and provides a consistent environment:

# 1. Clone and build
git clone https://github.com/uysalsadi/crawl4ai-mcp-server.git
cd crawl4ai-mcp-server
./docker-run.sh build

# 2. Test the installation
./docker-run.sh test

# 3. Run MCP server
./docker-run.sh run

Docker Commands Reference

Command	Description
`./docker-run.sh build`	Build the Docker image
`./docker-run.sh run`	Run MCP server in stdio mode
`./docker-run.sh test`	Run smoke tests
`./docker-run.sh dev`	Development mode with shell access
`./docker-run.sh stop`	Stop running containers
`./docker-run.sh clean`	Remove containers and images
`./docker-run.sh logs`	Show container logs

Docker Compose Services

The docker-compose.yml provides multiple service configurations:

crawl4ai-mcp: Production MCP server
crawl4ai-mcp-dev: Development container with shell access
crawl4ai-mcp-test: Test runner for smoke tests

MCP Integration with Docker

Global MCP Configuration (Claude Code)

Add to ~/.claude/claude_desktop_config.json:

{
  "mcpServers": {
    "crawl4ai-mcp": {
      "command": "docker",
      "args": [
        "run", "--rm", "-i", 
        "--volume", "/tmp/crawl4ai-crawls:/app/crawls",
        "uysalsadi/crawl4ai-mcp-server:latest"
      ],
      "env": {
        "CRAWL4AI_MCP_LOG": "INFO"
      }
    }
  }
}

✅ Copy this exact config - it uses the published Docker image!

Global MCP Configuration (Cursor)

Add to ~/.cursor/mcp.json:

{
  "mcpServers": {
    "crawl4ai-mcp": {
      "command": "docker",
      "args": [
        "run", "--rm", "-i",
        "--volume", "/tmp/crawl4ai-crawls:/app/crawls", 
        "uysalsadi/crawl4ai-mcp-server:latest"
      ],
      "env": {
        "CRAWL4AI_MCP_LOG": "INFO"
      }
    }
  }
}

✅ This configuration is tested and working!

Project-Scoped Configuration

Add to your project's .mcp.json:

{
  "mcpServers": {
    "crawl4ai-mcp": {
      "command": "docker",
      "args": [
        "run", "--rm", "-i",
        "--volume", "./crawls:/app/crawls",
        "uysalsadi/crawl4ai-mcp-server:latest"
      ],
      "env": {
        "CRAWL4AI_MCP_LOG": "INFO"
      }
    }
  }
}

✅ This project already includes this configuration - see .mcp.json

Configuration Validation

After setting up any MCP configuration:

Test the Docker image works:
```
python test-config.py
```
Restart your editor (Cursor/Claude Code) to reload MCP configuration
Verify tools are available:
- Look for crawl4ai-mcp in the MCP tools panel
- Should see 4 tools: scrape, crawl, crawl_site, crawl_sitemap

If tools don't appear, check:

Docker is running and image is accessible
MCP configuration file syntax is valid JSON
Editor has been restarted after config changes

Docker Advantages

✅ Zero Setup: No need for Python venv, pip, or Playwright installation
✅ Consistent Environment: Same behavior across all platforms
✅ Isolated Dependencies: No conflicts with your system Python
✅ Easy Updates: docker pull to get latest version
✅ Portable: Works anywhere Docker runs
✅ Volume Persistence: Crawl outputs saved to host filesystem

Environment Variables

Set these when running Docker containers:

# Using docker-compose
OPENAI_API_KEY=your-key docker-compose up crawl4ai-mcp

# Using docker run directly
docker run --rm -i \
  -e OPENAI_API_KEY=your-key \
  -e CRAWL4AI_MCP_LOG=DEBUG \
  -v ./crawls:/app/crawls \
  crawl4ai-mcp-server:latest

⚙️ Configuration

Environment Variables

TARGET_URL: Default URL for smoke testing (default: https://modelcontextprotocol.io/docs)
RESEARCH_TASK: Custom research task for agents example
OPENAI_API_KEY: Required for OpenAI Agents SDK

Safety Settings

The server blocks these URL patterns by default:

localhost, 127.0.0.1, ::1
Private IP ranges (RFC 1918)
file:// schemes
.local, .internal, .lan domains

🚀 Advanced Features

Adaptive Crawling

When adaptive: true is set, the crawler uses a simple content-based stopping strategy:

Stops when total content exceeds 5,000 characters
Prevents over-crawling for information gathering tasks
More sophisticated LLM-based strategies can be added

Custom Filtering

Use regex patterns to control which URLs are crawled:

await session.call_tool("crawl", {
    "seed_url": "https://docs.example.com",
    "include_patterns": [r"/docs/", r"/api/"],
    "exclude_patterns": [r"/old/", r"\.pdf$"]
})

Browser Configuration

Pass custom browser/crawler settings:

await session.call_tool("scrape", {
    "url": "https://example.com",
    "browser": {"headless": True, "viewport": {"width": 1280, "height": 720}},
    "crawler": {"verbose": True}
})

🏗️ Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   AI Agent      │───▶│   MCP Server     │───▶│   Crawl4AI      │
│ (Cursor/Agents) │    │ (stdio/stdio)    │    │ (Playwright)    │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                                │
                                ▼
                       ┌──────────────────┐
                       │  Safety Guards   │
                       │ (URL validation) │
                       └──────────────────┘

📦 Publishing to Docker Hub

For maintainers who want to publish updates to the Docker registry:

Publishing Process

# 1. Login to Docker Hub (one time setup)
./docker-push.sh login

# 2. Build, push, and test everything
./docker-push.sh all

# Or do steps individually:
./docker-push.sh build    # Build and tag image
./docker-push.sh push     # Push to Docker Hub
./docker-push.sh test     # Test the published image

Docker Hub Repository

The image is published at: uysalsadi/crawl4ai-mcp-server

Latest: uysalsadi/crawl4ai-mcp-server:latest
Versioned: uysalsadi/crawl4ai-mcp-server:v1.0.0

Usage Statistics

Users can pull and use the image without any local setup:

docker pull uysalsadi/crawl4ai-mcp-server:latest

🔧 Development

Project Structure

crawler_agent/
├── __init__.py
├── mcp_server.py         # Main MCP server
├── safety.py            # URL safety validation
├── adaptive_strategy.py  # Adaptive crawling logic
├── smoke_client.py       # Basic testing client
└── agents_example.py     # OpenAI Agents SDK example

Testing

# Test MCP server
python -m crawler_agent.smoke_client

# Test adaptive crawling
python test_adaptive.py

# Test with agents (requires OPENAI_API_KEY)
python -m crawler_agent.agents_example

Contributing

Environment Setup: Always activate .venv before development work
Documentation: Follow the Documentation Memory Rule - synchronize changes across CLAUDE.md, .cursorrules, and README.md
Code Style: Follow .cursorrules for Cursor or CLAUDE.md for Claude Code
Testing: Use python -m crawler_agent.smoke_client for validation
Safety: Ensure security guards are maintained for all new features
Dual Compatibility: Verify changes work in both Cursor and Claude Code environments

📚 References

📄 License

MIT License - see file for details.

🙏 Acknowledgments

This project uses Crawl4AI by UncleCode for web scraping capabilities. Crawl4AI is an excellent open-source LLM-friendly web crawler and scraper.

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

⭐ Support

If this project helped you, please give it a star! It helps others discover the project.

🐛 Issues

Found a bug or have a feature request? Please open an issue.