mcp-crawl4ai-server by cryptonicsurfer - MCP Server

MCP Crawl4AI Server

A powerful Model Context Protocol (MCP) server that provides AI agents with advanced web scraping and content extraction capabilities using Crawl4AI and Playwright.

🚀 Features

5 Powerful MCP Tools for comprehensive web scraping
AI-Powered Content Extraction with Google Gemini and Anthropic Claude
JavaScript Site Support with Playwright browser automation
Structured Data Extraction using CSS selectors
Batch Processing for multiple URLs
Secure API Key Authentication
Docker Deployment for easy VPS hosting
Swedish Language Support and international content handling

🛠️ Available Tools

1. `scrape_url`

Basic web content extraction with markdown formatting

Parameters:
- url: Target website URL
- api_key: Authentication key
- wait_for: CSS selector to wait for (optional)
- css_selector: Specific content selector (optional)
- exclude_tags: HTML tags to exclude (optional)

2. `scrape_with_css_extraction`

Structured data extraction using CSS selectors

Parameters:
- url: Target website URL  
- extraction_schema: Object mapping fields to CSS selectors
- api_key: Authentication key
- wait_for: Element to wait for (optional)

3. `scrape_with_llm_extraction`

AI-powered intelligent content analysis and extraction

Parameters:
- url: Target website URL
- extraction_prompt: Instructions for AI extraction
- api_key: Authentication key
- model: AI model (gemini-2.5-flash, claude-3-5-sonnet-20241022)
- provider: AI provider (google, anthropic)
- wait_for: Element to wait for (optional)

4. `scrape_multiple_urls`

Process multiple websites in batch (max 10 URLs)

Parameters:
- urls: Array of URLs to process
- api_key: Authentication key
- max_concurrent: Parallel processing limit (default: 3)
- css_selector: Selector for all URLs (optional)

5. `get_server_status`

Check server health, authentication status, and available tools

Parameters:
- api_key: Authentication key (optional)

📦 Quick Start

Option 1: Use Existing VPS Deployment

The server is already deployed and running! Just configure your AI agent:

Claude Desktop Configuration (~/Library/Application Support/Claude/claude_desktop_config.json):

{
  "mcpServers": {
    "crawl4ai-server": {
      "command": "ssh",
      "args": [
        "pallefrej@46.246.38.24",
        "cd ~/mcp-crawl4ai && docker exec -i mcp-crawl4ai-server python mcp_server.py"
      ],
      "env": {
        "MCP_API_KEY": "test-key-123"
      }
    }
  }
}

Option 2: Local Installation

Clone and Setup

git clone <this-repo>
cd mcp-crawl4ai
pip install -r requirements.txt
playwright install chromium

Configure Environment

cp .env.example .env
# Edit .env with your API keys

Run Server

python mcp_server.py

Configure AI Agent

{
  "mcpServers": {
    "crawl4ai-local": {
      "command": "python",
      "args": ["/path/to/mcp-crawl4ai/mcp_server.py"],
      "env": {
        "MCP_API_KEY": "test-key-123"
      }
    }
  }
}

Option 3: Docker Deployment

Build and Run

docker-compose up -d --build

Configure AI Agent

{
  "mcpServers": {
    "crawl4ai-docker": {
      "command": "docker",
      "args": ["exec", "-i", "mcp-crawl4ai-server", "python", "mcp_server.py"],
      "env": {
        "MCP_API_KEY": "test-key-123"
      }
    }
  }
}

🔐 Authentication

Default API Keys

test-key-123 - Development and testing
production-key-456 - Production usage

Custom API Keys

Add your own keys to .env:

MCP_API_KEYS=key1,key2,key3
GOOGLE_API_KEY=your-google-api-key
ANTHROPIC_API_KEY=your-anthropic-api-key

💡 Usage Examples

Basic Web Scraping

"Scrape the content from https://example.com and show me the main article"

Structured Data Extraction

"Extract all product names, prices, and descriptions from this e-commerce page"

AI-Powered Analysis

"Analyze this Swedish business website and extract: company info, services offered, contact details, and key benefits in Swedish"

Competitive Research

"Scrape these 5 competitor websites and compare their pricing models"

Content Monitoring

"Extract the latest news headlines from this news site and summarize the top 3 stories"

🌍 Supported Sites

✅ Static HTML Sites - Standard websites
✅ JavaScript/React Sites - GitHub, Reddit, modern SPAs
✅ Swedish Content - Euromaster.se, Swedish business sites
✅ E-commerce Sites - Product catalogs, pricing pages
✅ News Sites - Article extraction, headline monitoring
✅ Business Sites - Company info, service descriptions

🎯 AI Integration

Google Gemini Models

gemini-2.5-flash - Fast, cost-effective extraction
Best for: Quick summaries, basic data extraction

Anthropic Claude Models

claude-3-5-sonnet-20241022 - Advanced reasoning and analysis
Best for: Complex analysis, structured data, nuanced content

Example AI Extraction

Tool: scrape_with_llm_extraction
URL: https://company.com/about
Prompt: "Extract company data in JSON format: {name, founded, employees, services, contact_email}"
Model: claude-3-5-sonnet-20241022

🐳 Docker Configuration

Environment Variables

# Authentication
MCP_API_KEYS=test-key-123,production-key-456

# AI Integration  
GOOGLE_API_KEY=your-google-api-key
ANTHROPIC_API_KEY=your-anthropic-api-key

# Crawl4AI Settings
CRAWL4AI_CACHE_DIR=/tmp/crawl4ai_cache
CRAWL4AI_LOG_LEVEL=INFO

Volume Mounts

crawl4ai_cache:/tmp/crawl4ai_cache - Browser cache for performance

Container Management

# Start server
docker-compose up -d

# View logs
docker logs mcp-crawl4ai-server --follow

# Restart server  
docker-compose restart

# Update and rebuild
docker-compose down && docker-compose up -d --build

📊 Performance & Limits

Max URLs per batch: 10 URLs
Default concurrency: 3 parallel requests
Browser pooling: Shared Chromium instance for efficiency
Cache system: Persistent storage for improved performance
Memory usage: ~200-500MB depending on content complexity

🔧 Troubleshooting

Server Not Responding

# Check container status
docker-compose ps

# View detailed logs
docker logs mcp-crawl4ai-server --tail 50

# Restart container
docker-compose restart

Authentication Issues

Verify MCP_API_KEY in your AI agent configuration
Check server logs for authentication attempts
Ensure API key exists in MCP_API_KEYS environment variable

Scraping Failures

Some sites may block automated requests
Try adding wait_for parameter for dynamic content
Check if site requires specific user agents or headers

📁 Project Structure

mcp-crawl4ai/
├── mcp_server.py              # Main MCP server
├── requirements.txt           # Python dependencies  
├── .env                       # Environment configuration
├── Dockerfile                # Container definition
├── docker-compose.yml        # Container orchestration
├── README.md                 # This file
├── MCP_CONFIGURATION.md      # Detailed configuration guide
└── examples/                 # Usage examples

🔄 Updates & Maintenance

Updating the Server

Update code files
Rebuild container: docker-compose up -d --build
Test functionality with status check

Monitoring

Container logs: docker logs mcp-crawl4ai-server
Server status: Use get_server_status tool
Resource usage: docker stats mcp-crawl4ai-server

⚠️ Important Notes

Respect robots.txt and website terms of service
Rate limiting is built-in but be mindful of target sites
JavaScript execution requires resources - monitor container memory
API keys for LLM features are optional but enhance functionality
Network access required for both scraping and AI API calls

📞 Support

Configuration Help

See for detailed setup instructions.

Common Issues

MCP Server not detected: Check JSON syntax and restart AI client
Permission denied: Verify SSH access and API keys
Docker issues: Ensure Docker daemon is running

Debug Mode

Enable detailed logging:

CRAWL4AI_LOG_LEVEL=DEBUG

🎊 Success Indicators

When properly configured, you should see:

🟢 MCP server indicator in your AI client
✅ 5 tools available when asking "What tools do you have?"
🤖 Successful scraping of test websites
🔐 Authentication working with your API keys

📄 License

This project is designed for educational and development purposes. Ensure compliance with target websites' terms of service and applicable laws when scraping content.

Ready to supercharge your AI agent with web scraping capabilities? Start with the Quick Start guide above!

cryptonicsurfer/mcp-crawl4ai-server