mcp-crawl4ai-server

cryptonicsurfer/mcp-crawl4ai-server

3.1

If you are the rightful owner of mcp-crawl4ai-server and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.

MCP Crawl4AI Server is a robust Model Context Protocol server designed to enhance AI agents with advanced web scraping and content extraction capabilities using Crawl4AI and Playwright.

Tools
5
Resources
0
Prompts
0

MCP Crawl4AI Server

A powerful Model Context Protocol (MCP) server that provides AI agents with advanced web scraping and content extraction capabilities using Crawl4AI and Playwright.

🚀 Features

  • 5 Powerful MCP Tools for comprehensive web scraping
  • AI-Powered Content Extraction with Google Gemini and Anthropic Claude
  • JavaScript Site Support with Playwright browser automation
  • Structured Data Extraction using CSS selectors
  • Batch Processing for multiple URLs
  • Secure API Key Authentication
  • Docker Deployment for easy VPS hosting
  • Swedish Language Support and international content handling

🛠️ Available Tools

1. scrape_url

Basic web content extraction with markdown formatting

Parameters:
- url: Target website URL
- api_key: Authentication key
- wait_for: CSS selector to wait for (optional)
- css_selector: Specific content selector (optional)
- exclude_tags: HTML tags to exclude (optional)

2. scrape_with_css_extraction

Structured data extraction using CSS selectors

Parameters:
- url: Target website URL  
- extraction_schema: Object mapping fields to CSS selectors
- api_key: Authentication key
- wait_for: Element to wait for (optional)

3. scrape_with_llm_extraction

AI-powered intelligent content analysis and extraction

Parameters:
- url: Target website URL
- extraction_prompt: Instructions for AI extraction
- api_key: Authentication key
- model: AI model (gemini-2.5-flash, claude-3-5-sonnet-20241022)
- provider: AI provider (google, anthropic)
- wait_for: Element to wait for (optional)

4. scrape_multiple_urls

Process multiple websites in batch (max 10 URLs)

Parameters:
- urls: Array of URLs to process
- api_key: Authentication key
- max_concurrent: Parallel processing limit (default: 3)
- css_selector: Selector for all URLs (optional)

5. get_server_status

Check server health, authentication status, and available tools

Parameters:
- api_key: Authentication key (optional)

📦 Quick Start

Option 1: Use Existing VPS Deployment

The server is already deployed and running! Just configure your AI agent:

Claude Desktop Configuration (~/Library/Application Support/Claude/claude_desktop_config.json):

{
  "mcpServers": {
    "crawl4ai-server": {
      "command": "ssh",
      "args": [
        "pallefrej@46.246.38.24",
        "cd ~/mcp-crawl4ai && docker exec -i mcp-crawl4ai-server python mcp_server.py"
      ],
      "env": {
        "MCP_API_KEY": "test-key-123"
      }
    }
  }
}

Option 2: Local Installation

  1. Clone and Setup
git clone <this-repo>
cd mcp-crawl4ai
pip install -r requirements.txt
playwright install chromium
  1. Configure Environment
cp .env.example .env
# Edit .env with your API keys
  1. Run Server
python mcp_server.py
  1. Configure AI Agent
{
  "mcpServers": {
    "crawl4ai-local": {
      "command": "python",
      "args": ["/path/to/mcp-crawl4ai/mcp_server.py"],
      "env": {
        "MCP_API_KEY": "test-key-123"
      }
    }
  }
}

Option 3: Docker Deployment

  1. Build and Run
docker-compose up -d --build
  1. Configure AI Agent
{
  "mcpServers": {
    "crawl4ai-docker": {
      "command": "docker",
      "args": ["exec", "-i", "mcp-crawl4ai-server", "python", "mcp_server.py"],
      "env": {
        "MCP_API_KEY": "test-key-123"
      }
    }
  }
}

🔐 Authentication

Default API Keys

  • test-key-123 - Development and testing
  • production-key-456 - Production usage

Custom API Keys

Add your own keys to .env:

MCP_API_KEYS=key1,key2,key3
GOOGLE_API_KEY=your-google-api-key
ANTHROPIC_API_KEY=your-anthropic-api-key

💡 Usage Examples

Basic Web Scraping

"Scrape the content from https://example.com and show me the main article"

Structured Data Extraction

"Extract all product names, prices, and descriptions from this e-commerce page"

AI-Powered Analysis

"Analyze this Swedish business website and extract: company info, services offered, contact details, and key benefits in Swedish"

Competitive Research

"Scrape these 5 competitor websites and compare their pricing models"

Content Monitoring

"Extract the latest news headlines from this news site and summarize the top 3 stories"

🌍 Supported Sites

  • Static HTML Sites - Standard websites
  • JavaScript/React Sites - GitHub, Reddit, modern SPAs
  • Swedish Content - Euromaster.se, Swedish business sites
  • E-commerce Sites - Product catalogs, pricing pages
  • News Sites - Article extraction, headline monitoring
  • Business Sites - Company info, service descriptions

🎯 AI Integration

Google Gemini Models

  • gemini-2.5-flash - Fast, cost-effective extraction
  • Best for: Quick summaries, basic data extraction

Anthropic Claude Models

  • claude-3-5-sonnet-20241022 - Advanced reasoning and analysis
  • Best for: Complex analysis, structured data, nuanced content

Example AI Extraction

Tool: scrape_with_llm_extraction
URL: https://company.com/about
Prompt: "Extract company data in JSON format: {name, founded, employees, services, contact_email}"
Model: claude-3-5-sonnet-20241022

🐳 Docker Configuration

Environment Variables

# Authentication
MCP_API_KEYS=test-key-123,production-key-456

# AI Integration  
GOOGLE_API_KEY=your-google-api-key
ANTHROPIC_API_KEY=your-anthropic-api-key

# Crawl4AI Settings
CRAWL4AI_CACHE_DIR=/tmp/crawl4ai_cache
CRAWL4AI_LOG_LEVEL=INFO

Volume Mounts

  • crawl4ai_cache:/tmp/crawl4ai_cache - Browser cache for performance

Container Management

# Start server
docker-compose up -d

# View logs
docker logs mcp-crawl4ai-server --follow

# Restart server  
docker-compose restart

# Update and rebuild
docker-compose down && docker-compose up -d --build

📊 Performance & Limits

  • Max URLs per batch: 10 URLs
  • Default concurrency: 3 parallel requests
  • Browser pooling: Shared Chromium instance for efficiency
  • Cache system: Persistent storage for improved performance
  • Memory usage: ~200-500MB depending on content complexity

🔧 Troubleshooting

Server Not Responding

# Check container status
docker-compose ps

# View detailed logs
docker logs mcp-crawl4ai-server --tail 50

# Restart container
docker-compose restart

Authentication Issues

  • Verify MCP_API_KEY in your AI agent configuration
  • Check server logs for authentication attempts
  • Ensure API key exists in MCP_API_KEYS environment variable

Scraping Failures

  • Some sites may block automated requests
  • Try adding wait_for parameter for dynamic content
  • Check if site requires specific user agents or headers

📁 Project Structure

mcp-crawl4ai/
├── mcp_server.py              # Main MCP server
├── requirements.txt           # Python dependencies  
├── .env                       # Environment configuration
├── Dockerfile                # Container definition
├── docker-compose.yml        # Container orchestration
├── README.md                 # This file
├── MCP_CONFIGURATION.md      # Detailed configuration guide
└── examples/                 # Usage examples

🔄 Updates & Maintenance

Updating the Server

  1. Update code files
  2. Rebuild container: docker-compose up -d --build
  3. Test functionality with status check

Monitoring

  • Container logs: docker logs mcp-crawl4ai-server
  • Server status: Use get_server_status tool
  • Resource usage: docker stats mcp-crawl4ai-server

⚠️ Important Notes

  • Respect robots.txt and website terms of service
  • Rate limiting is built-in but be mindful of target sites
  • JavaScript execution requires resources - monitor container memory
  • API keys for LLM features are optional but enhance functionality
  • Network access required for both scraping and AI API calls

📞 Support

Configuration Help

See for detailed setup instructions.

Common Issues

  1. MCP Server not detected: Check JSON syntax and restart AI client
  2. Permission denied: Verify SSH access and API keys
  3. Docker issues: Ensure Docker daemon is running

Debug Mode

Enable detailed logging:

CRAWL4AI_LOG_LEVEL=DEBUG

🎊 Success Indicators

When properly configured, you should see:

  • 🟢 MCP server indicator in your AI client
  • 5 tools available when asking "What tools do you have?"
  • 🤖 Successful scraping of test websites
  • 🔐 Authentication working with your API keys

📄 License

This project is designed for educational and development purposes. Ensure compliance with target websites' terms of service and applicable laws when scraping content.


Ready to supercharge your AI agent with web scraping capabilities? Start with the Quick Start guide above!