crawl4ai-mcp-server

sadiuysal/crawl4ai-mcp-server

3.3

If you are the rightful owner of crawl4ai-mcp-server and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

Crawl4AI MCP Server is a lightweight Model Context Protocol server that provides web scraping and crawling capabilities for AI agents.

Tools
4
Resources
0
Prompts
0

Crawl4AI MCP Server

πŸ•·οΈ A lightweight Model Context Protocol (MCP) server that exposes Crawl4AI web scraping and crawling capabilities as tools for AI agents.

Similar to Firecrawl's API but self-hosted and free. Perfect for integrating web scraping into your AI workflows with OpenAI Agents SDK, Cursor, Claude Code, and other MCP-compatible tools.

Features

  • πŸ”§ MCP Tools: Exposes 4 powerful tools: scrape, crawl, crawl_site, crawl_sitemap via stdio MCP server
  • 🌐 Web Scraping: Single-page scraping with markdown extraction
  • πŸ•·οΈ Web Crawling: Multi-page breadth-first crawling with depth control
  • 🧠 Adaptive Crawling: Smart crawling that stops when sufficient content is gathered
  • πŸ›‘οΈ Safety: Blocks internal networks, localhost, and private IPs
  • πŸ“± Agent Ready: Works with OpenAI Agents SDK, Cursor, and Claude Code
  • ⚑ Fast: Powered by Playwright and Crawl4AI's optimized extraction

πŸš€ Quick Start

Choose between Docker (recommended) or manual installation:

Option A: Docker (Recommended) 🐳

Docker eliminates all setup complexity and provides a consistent environment:

Option A1: Use Pre-built Image (Fastest) ⚑
# No setup required! Just pull and run the published image
docker pull uysalsadi/crawl4ai-mcp-server:latest

# Test it works
python test-config.py

# Use directly in MCP configurations (see examples below)
Option A2: Build Yourself
# Clone the repository
git clone https://github.com/uysalsadi/crawl4ai-mcp-server.git
cd crawl4ai-mcp-server

# Quick build and test (simplified)
docker build -f Dockerfile.simple -t crawl4ai-mcp .
echo '{"jsonrpc": "2.0", "id": 1, "method": "initialize", "params": {"protocolVersion": "2024-11-05", "capabilities": {}, "clientInfo": {"name": "test", "version": "1.0"}}}' | docker run --rm -i crawl4ai-mcp

# Or use helper script (full build with Playwright)
./docker-run.sh build
./docker-run.sh test
./docker-run.sh run

Docker Quick Commands:

  • ./docker-run.sh build - Build the image
  • ./docker-run.sh run - Run MCP server (stdio mode)
  • ./docker-run.sh test - Run smoke tests
  • ./docker-run.sh dev - Development mode with shell access

Option B: Manual Installation

# Clone and setup
git clone https://github.com/uysalsadi/crawl4ai-mcp-server.git
cd crawl4ai-mcp-server
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

# Install Playwright browsers
python -m playwright install chromium

# Test basic functionality
python -m crawler_agent.smoke_client

# Test adaptive crawling
python test_adaptive.py

Use with OpenAI Agents SDK

# Set your OpenAI API key
export OPENAI_API_KEY="your-key-here"

# Docker: Run the example agent
docker-compose run --rm -e OPENAI_API_KEY crawl4ai-mcp python -m crawler_agent.agents_example

# Manual: Run the example agent
python -m crawler_agent.agents_example

πŸ› οΈ Tools Reference

The MCP server exposes 4 production-ready tools with content hiding features:

scrape

Fetch a single URL and return markdown content.

Arguments:

  • url (required): The URL to scrape
  • output_dir (optional): If provided, persists content to disk and returns metadata only
  • crawler: Optional Crawl4AI crawler config overrides
  • browser: Optional Crawl4AI browser config overrides
  • script: Optional C4A-Script for page interaction
  • timeout_sec: Request timeout (default: 45s, max: 600s)

Returns (without output_dir):

{
  "url": "https://example.com",
  "markdown": "# Page Title\n\nContent...",
  "links": ["https://example.com/page1", "..."],
  "metadata": {}
}

Returns (with output_dir):

{
  "run_id": "scrape_20250815_122811_4fb188",
  "file_path": "/path/to/output/pages/example.com_index.md",
  "manifest_path": "/path/to/output/manifest.json",
  "bytes_written": 230
}

crawl

Multi-page breadth-first crawling with filtering and adaptive stopping.

Arguments:

  • seed_url (required): Starting URL for the crawl
  • max_depth: Maximum link depth to follow (default: 1, max: 4)
  • max_pages: Maximum pages to crawl (default: 5, max: 100)
  • same_domain_only: Stay within the same domain (default: true)
  • include_patterns: Regex patterns URLs must match
  • exclude_patterns: Regex patterns to exclude URLs
  • adaptive: Enable adaptive crawling (default: false)
  • output_dir (optional): If provided, persists content to disk and returns metadata only
  • crawler, browser, script, timeout_sec: Same as scrape

Returns (without output_dir):

{
  "start_url": "https://example.com",
  "pages": [
    {
      "url": "https://example.com/page1",
      "markdown": "Content...",
      "links": ["..."]
    }
  ],
  "total_pages": 3
}

Returns (with output_dir):

{
  "run_id": "crawl_20250815_122828_464944",
  "pages_ok": 3,
  "pages_failed": 0,
  "manifest_path": "/path/to/output/manifest.json",
  "bytes_written": 690
}

crawl_site

Comprehensive site crawling with persistence (always requires output_dir).

Arguments:

  • entry_url (required): Starting URL for site crawl
  • output_dir (required): Directory to persist results
  • max_depth: Maximum crawl depth (default: 2, max: 6)
  • max_pages: Maximum pages to crawl (default: 200, max: 5000)
  • Additional config options for filtering and performance

Returns:

{
  "run_id": "site_20250815_122851_0e2455",
  "output_dir": "/path/to/output",
  "manifest_path": "/path/to/output/manifest.json",
  "pages_ok": 15,
  "pages_failed": 2,
  "bytes_written": 45672
}

crawl_sitemap

Sitemap-based crawling with persistence (always requires output_dir).

Arguments:

  • sitemap_url (required): URL to sitemap.xml
  • output_dir (required): Directory to persist results
  • max_entries: Maximum sitemap entries to process (default: 1000)
  • Additional config options for filtering and performance

Returns:

{
  "run_id": "sitemap_20250815_123006_667d71",
  "output_dir": "/path/to/output",
  "manifest_path": "/path/to/output/manifest.json", 
  "pages_ok": 25,
  "pages_failed": 0,
  "bytes_written": 123456
}

πŸ’‘ Usage Examples

Standalone MCP Server

from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

params = StdioServerParameters(
    command="python",
    args=["-m", "crawler_agent.mcp_server"]
)

async with stdio_client(params) as (read, write):
    async with ClientSession(read, write) as session:
        await session.initialize()
        
        # Scrape a single page
        result = await session.call_tool("scrape", {
            "url": "https://example.com"
        })
        
        # Crawl with adaptive stopping
        result = await session.call_tool("crawl", {
            "seed_url": "https://docs.example.com",
            "max_pages": 10,
            "adaptive": True
        })

OpenAI Agents SDK Integration

from agents import Agent, Runner
from agents.mcp.server import MCPServerStdio

async with MCPServerStdio(
    params={
        "command": "python", 
        "args": ["-m", "crawler_agent.mcp_server"]
    },
    cache_tools_list=True
) as server:
    
    agent = Agent(
        name="Research Assistant",
        instructions="Use scrape and crawl tools to research topics.",
        mcp_servers=[server],
        mcp_config={"convert_schemas_to_strict": True}
    )
    
    result = await Runner.run(
        agent, 
        "Research the latest AI safety papers"
    )

Cursor/Claude Code Integration

This project supports both Cursor and Claude Code with synchronized configuration:

Cursor Setup
  • Uses .cursorrules for AI assistant behavior and project guidance
  • Automatic MCP tool detection and integration
  • Project-specific rules and workflow guidance
Claude Code Setup
  • Uses CLAUDE.md for comprehensive project context and memory bank
  • Native MCP integration via mcp__crawl4ai-mcp__* tool calls
  • Global config: ~/.claude/claude_desktop_config.json or project config: .mcp.json
Docker Integration for Both Environments

Both Cursor and Claude Code can use the Dockerized MCP server:

// .mcp.json (project-scoped configuration)
{
  "mcpServers": {
    "crawl4ai-mcp": {
      "command": "docker",
      "args": [
        "run", "--rm", "-i",
        "--volume", "./crawls:/app/crawls",
        "uysalsadi/crawl4ai-mcp-server:latest"
      ],
      "env": {
        "CRAWL4AI_MCP_LOG": "INFO"
      }
    }
  }
}
Dual Environment Features
  • Synchronized Documentation: Changes to documentation are maintained across both environments
  • Shared MCP Configuration: Both use the same MCP server and tool schemas
  • Docker Compatibility: Consistent environment across both Cursor and Claude Code
  • Cross-Compatible: Projects work seamlessly whether using Cursor or Claude Code

🐳 Docker Usage

Quick Start with Docker

The Docker approach eliminates all manual setup and provides a consistent environment:

# 1. Clone and build
git clone https://github.com/uysalsadi/crawl4ai-mcp-server.git
cd crawl4ai-mcp-server
./docker-run.sh build

# 2. Test the installation
./docker-run.sh test

# 3. Run MCP server
./docker-run.sh run

Docker Commands Reference

CommandDescription
./docker-run.sh buildBuild the Docker image
./docker-run.sh runRun MCP server in stdio mode
./docker-run.sh testRun smoke tests
./docker-run.sh devDevelopment mode with shell access
./docker-run.sh stopStop running containers
./docker-run.sh cleanRemove containers and images
./docker-run.sh logsShow container logs

Docker Compose Services

The docker-compose.yml provides multiple service configurations:

  • crawl4ai-mcp: Production MCP server
  • crawl4ai-mcp-dev: Development container with shell access
  • crawl4ai-mcp-test: Test runner for smoke tests

MCP Integration with Docker

Global MCP Configuration (Claude Code)

Add to ~/.claude/claude_desktop_config.json:

{
  "mcpServers": {
    "crawl4ai-mcp": {
      "command": "docker",
      "args": [
        "run", "--rm", "-i", 
        "--volume", "/tmp/crawl4ai-crawls:/app/crawls",
        "uysalsadi/crawl4ai-mcp-server:latest"
      ],
      "env": {
        "CRAWL4AI_MCP_LOG": "INFO"
      }
    }
  }
}

βœ… Copy this exact config - it uses the published Docker image!

Global MCP Configuration (Cursor)

Add to ~/.cursor/mcp.json:

{
  "mcpServers": {
    "crawl4ai-mcp": {
      "command": "docker",
      "args": [
        "run", "--rm", "-i",
        "--volume", "/tmp/crawl4ai-crawls:/app/crawls", 
        "uysalsadi/crawl4ai-mcp-server:latest"
      ],
      "env": {
        "CRAWL4AI_MCP_LOG": "INFO"
      }
    }
  }
}

βœ… This configuration is tested and working!

Project-Scoped Configuration

Add to your project's .mcp.json:

{
  "mcpServers": {
    "crawl4ai-mcp": {
      "command": "docker",
      "args": [
        "run", "--rm", "-i",
        "--volume", "./crawls:/app/crawls",
        "uysalsadi/crawl4ai-mcp-server:latest"
      ],
      "env": {
        "CRAWL4AI_MCP_LOG": "INFO"
      }
    }
  }
}

βœ… This project already includes this configuration - see .mcp.json

Configuration Validation

After setting up any MCP configuration:

  1. Test the Docker image works:

    python test-config.py
    
  2. Restart your editor (Cursor/Claude Code) to reload MCP configuration

  3. Verify tools are available:

    • Look for crawl4ai-mcp in the MCP tools panel
    • Should see 4 tools: scrape, crawl, crawl_site, crawl_sitemap

If tools don't appear, check:

  • Docker is running and image is accessible
  • MCP configuration file syntax is valid JSON
  • Editor has been restarted after config changes

Docker Advantages

βœ… Zero Setup: No need for Python venv, pip, or Playwright installation
βœ… Consistent Environment: Same behavior across all platforms
βœ… Isolated Dependencies: No conflicts with your system Python
βœ… Easy Updates: docker pull to get latest version
βœ… Portable: Works anywhere Docker runs
βœ… Volume Persistence: Crawl outputs saved to host filesystem

Environment Variables

Set these when running Docker containers:

# Using docker-compose
OPENAI_API_KEY=your-key docker-compose up crawl4ai-mcp

# Using docker run directly
docker run --rm -i \
  -e OPENAI_API_KEY=your-key \
  -e CRAWL4AI_MCP_LOG=DEBUG \
  -v ./crawls:/app/crawls \
  crawl4ai-mcp-server:latest

βš™οΈ Configuration

Environment Variables

  • TARGET_URL: Default URL for smoke testing (default: https://modelcontextprotocol.io/docs)
  • RESEARCH_TASK: Custom research task for agents example
  • OPENAI_API_KEY: Required for OpenAI Agents SDK

Safety Settings

The server blocks these URL patterns by default:

  • localhost, 127.0.0.1, ::1
  • Private IP ranges (RFC 1918)
  • file:// schemes
  • .local, .internal, .lan domains

πŸš€ Advanced Features

Adaptive Crawling

When adaptive: true is set, the crawler uses a simple content-based stopping strategy:

  • Stops when total content exceeds 5,000 characters
  • Prevents over-crawling for information gathering tasks
  • More sophisticated LLM-based strategies can be added

Custom Filtering

Use regex patterns to control which URLs are crawled:

await session.call_tool("crawl", {
    "seed_url": "https://docs.example.com",
    "include_patterns": [r"/docs/", r"/api/"],
    "exclude_patterns": [r"/old/", r"\.pdf$"]
})

Browser Configuration

Pass custom browser/crawler settings:

await session.call_tool("scrape", {
    "url": "https://example.com",
    "browser": {"headless": True, "viewport": {"width": 1280, "height": 720}},
    "crawler": {"verbose": True}
})

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   AI Agent      │───▢│   MCP Server     │───▢│   Crawl4AI      β”‚
β”‚ (Cursor/Agents) β”‚    β”‚ (stdio/stdio)    β”‚    β”‚ (Playwright)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                                β–Ό
                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                       β”‚  Safety Guards   β”‚
                       β”‚ (URL validation) β”‚
                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“¦ Publishing to Docker Hub

For maintainers who want to publish updates to the Docker registry:

Publishing Process

# 1. Login to Docker Hub (one time setup)
./docker-push.sh login

# 2. Build, push, and test everything
./docker-push.sh all

# Or do steps individually:
./docker-push.sh build    # Build and tag image
./docker-push.sh push     # Push to Docker Hub
./docker-push.sh test     # Test the published image

Docker Hub Repository

The image is published at: uysalsadi/crawl4ai-mcp-server

  • Latest: uysalsadi/crawl4ai-mcp-server:latest
  • Versioned: uysalsadi/crawl4ai-mcp-server:v1.0.0

Usage Statistics

Users can pull and use the image without any local setup:

docker pull uysalsadi/crawl4ai-mcp-server:latest

πŸ”§ Development

Project Structure

crawler_agent/
β”œβ”€β”€ __init__.py
β”œβ”€β”€ mcp_server.py         # Main MCP server
β”œβ”€β”€ safety.py            # URL safety validation
β”œβ”€β”€ adaptive_strategy.py  # Adaptive crawling logic
β”œβ”€β”€ smoke_client.py       # Basic testing client
└── agents_example.py     # OpenAI Agents SDK example

Testing

# Test MCP server
python -m crawler_agent.smoke_client

# Test adaptive crawling
python test_adaptive.py

# Test with agents (requires OPENAI_API_KEY)
python -m crawler_agent.agents_example

Contributing

  1. Environment Setup: Always activate .venv before development work
  2. Documentation: Follow the Documentation Memory Rule - synchronize changes across CLAUDE.md, .cursorrules, and README.md
  3. Code Style: Follow .cursorrules for Cursor or CLAUDE.md for Claude Code
  4. Testing: Use python -m crawler_agent.smoke_client for validation
  5. Safety: Ensure security guards are maintained for all new features
  6. Dual Compatibility: Verify changes work in both Cursor and Claude Code environments

πŸ“š References

πŸ“„ License

MIT License - see file for details.

πŸ™ Acknowledgments

This project uses Crawl4AI by UncleCode for web scraping capabilities. Crawl4AI is an excellent open-source LLM-friendly web crawler and scraper.

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

⭐ Support

If this project helped you, please give it a star! It helps others discover the project.

πŸ› Issues

Found a bug or have a feature request? Please open an issue.