freecrawl-mcp

dylan-gluck/freecrawl-mcp

3.2

If you are the rightful owner of freecrawl-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

FreeCrawl MCP Server is a self-hosted solution for web scraping and document processing, designed to replace Firecrawl.

Tools
5
Resources
0
Prompts
0

FreeCrawl MCP Server

A production-ready Model Context Protocol (MCP) server for web scraping and document processing, designed as a self-hosted replacement for Firecrawl.

🚀 Features

  • JavaScript-enabled web scraping with Playwright and anti-detection measures
  • Document processing with fallback support for various formats
  • Concurrent batch processing with configurable limits
  • Intelligent caching with SQLite backend
  • Rate limiting per domain
  • Comprehensive error handling with retry logic
  • Easy installation via uvx or local development setup
  • Health monitoring and metrics collection

MCP Config (using uvx)

{
  "mcpServers": {
    "freecrawl": {
      "command": "uvx",
      "args": ["freecrawl-mcp"],
    }
  }
}

📦 Installation & Usage

Quick Start with uvx (Recommended)

The easiest way to use FreeCrawl is with uvx, which automatically manages dependencies:

# Install browsers on first run
uvx freecrawl-mcp --install-browsers

# Test functionality
uvx freecrawl-mcp --test

Local Development Setup

For local development or customization:

  1. Clone from GitHub:

    git clone https://github.com/dylan-gluck/freecrawl-mcp.git
    cd freecrawl-mcp
    
  2. Set up environment:

    # Sync dependencies
    uv sync
    
    # Install browser dependencies
    uv run freecrawl-mcp --install-browsers
    
    # Run tests
    uv run freecrawl-mcp --test
    
  3. Run the server:

    uv run freecrawl-mcp
    

🛠 Configuration

Configure FreeCrawl using environment variables:

Basic Configuration

# Transport (stdio for MCP, http for REST API)
export FREECRAWL_TRANSPORT=stdio

# Browser pool settings
export FREECRAWL_MAX_BROWSERS=3
export FREECRAWL_HEADLESS=true

# Concurrency limits
export FREECRAWL_MAX_CONCURRENT=10
export FREECRAWL_MAX_PER_DOMAIN=3

# Cache settings
export FREECRAWL_CACHE=true
export FREECRAWL_CACHE_DIR=/tmp/freecrawl_cache
export FREECRAWL_CACHE_TTL=3600
export FREECRAWL_CACHE_SIZE=536870912  # 512MB

# Rate limiting
export FREECRAWL_RATE_LIMIT=60  # requests per minute

# Logging
export FREECRAWL_LOG_LEVEL=INFO

Security Settings

# API authentication (optional)
export FREECRAWL_REQUIRE_API_KEY=false
export FREECRAWL_API_KEYS=key1,key2,key3

# Domain blocking
export FREECRAWL_BLOCKED_DOMAINS=localhost,127.0.0.1

# Anti-detection
export FREECRAWL_ANTI_DETECT=true
export FREECRAWL_ROTATE_UA=true

🔧 MCP Tools

FreeCrawl provides the following MCP tools:

freecrawl_scrape

Scrape content from a single URL with advanced options.

Parameters:

  • url (string): URL to scrape
  • formats (array): Output formats - ["markdown", "html", "text", "screenshot", "structured"]
  • javascript (boolean): Enable JavaScript execution (default: true)
  • wait_for (string, optional): CSS selector or time (ms) to wait
  • anti_bot (boolean): Enable anti-detection measures (default: true)
  • headers (object, optional): Custom HTTP headers
  • cookies (object, optional): Custom cookies
  • cache (boolean): Use cached results if available (default: true)
  • timeout (number): Total timeout in milliseconds (default: 30000)

Example:

{
  "name": "freecrawl_scrape",
  "arguments": {
    "url": "https://example.com",
    "formats": ["markdown", "screenshot"],
    "javascript": true,
    "wait_for": "2000"
  }
}

freecrawl_batch_scrape

Scrape multiple URLs concurrently.

Parameters:

  • urls (array): List of URLs to scrape (max 100)
  • concurrency (number): Maximum concurrent requests (default: 5)
  • formats (array): Output formats (default: ["markdown"])
  • common_options (object, optional): Options applied to all URLs
  • continue_on_error (boolean): Continue if individual URLs fail (default: true)

Example:

{
  "name": "freecrawl_batch_scrape",
  "arguments": {
    "urls": [
      "https://example.com/page1",
      "https://example.com/page2"
    ],
    "concurrency": 3,
    "formats": ["markdown", "text"]
  }
}

freecrawl_extract

Extract structured data using schema-driven approach.

Parameters:

  • url (string): URL to extract data from
  • schema (object): JSON Schema or Pydantic model definition
  • prompt (string, optional): Custom extraction instructions
  • validation (boolean): Validate against schema (default: true)
  • multiple (boolean): Extract multiple matching items (default: false)

Example:

{
  "name": "freecrawl_extract",
  "arguments": {
    "url": "https://example.com/product",
    "schema": {
      "type": "object",
      "properties": {
        "title": {"type": "string"},
        "price": {"type": "number"}
      }
    }
  }
}

freecrawl_process_document

Process documents (PDF, DOCX, etc.) with OCR support.

Parameters:

  • file_path (string, optional): Path to document file
  • url (string, optional): URL to download document from
  • strategy (string): Processing strategy - "fast", "hi_res", "ocr_only" (default: "hi_res")
  • formats (array): Output formats - ["markdown", "structured", "text"]
  • languages (array, optional): OCR languages (e.g., ["eng", "fra"])
  • extract_images (boolean): Extract embedded images (default: false)
  • extract_tables (boolean): Extract and structure tables (default: true)

Example:

{
  "name": "freecrawl_process_document",
  "arguments": {
    "url": "https://example.com/document.pdf",
    "strategy": "hi_res",
    "formats": ["markdown", "structured"]
  }
}

freecrawl_health_check

Get server health status and metrics.

Example:

{
  "name": "freecrawl_health_check",
  "arguments": {}
}

🔄 Integration with Claude Code

MCP Configuration

Add FreeCrawl to your MCP configuration:

Using uvx (Recommended):

{
  "mcpServers": {
    "freecrawl": {
      "command": "uvx",
      "args": ["freecrawl-mcp"]
    }
  }
}

Using local development setup:

{
  "mcpServers": {
    "freecrawl": {
      "command": "uv",
      "args": ["run", "freecrawl-mcp"],
      "cwd": "/path/to/freecrawl-mcp"
    }
  }
}

Usage in Prompts

Please scrape the content from https://example.com and extract the main article text in markdown format.

Claude Code will automatically use the freecrawl_scrape tool to fetch and process the content.

🚀 Performance & Scalability

Resource Usage

  • Memory: ~100MB base + ~50MB per browser instance
  • CPU: Moderate usage during active scraping
  • Storage: Cache grows based on configured limits

Throughput

  • Single requests: 2-5 seconds typical response time
  • Batch processing: 10-50 concurrent requests depending on configuration
  • Cache hit ratio: 30%+ for repeated content

Optimization Tips

  1. Enable caching for frequently accessed content
  2. Adjust concurrency based on target site rate limits
  3. Use appropriate formats - markdown is faster than screenshots
  4. Configure rate limiting to avoid being blocked

🛡 Security Considerations

Anti-Detection

  • Rotating user agents
  • Realistic browser fingerprints
  • Request timing randomization
  • JavaScript execution in sandboxed environment

Input Validation

  • URL format validation
  • Private IP blocking
  • Domain blocklist support
  • Request size limits

Resource Protection

  • Memory usage monitoring
  • Browser pool size limits
  • Request timeout enforcement
  • Rate limiting per domain

🔧 Troubleshooting

Common Issues

IssuePossible CauseSolution
High memory usageToo many browser instancesReduce FREECRAWL_MAX_BROWSERS
Slow responsesJavaScript-heavy sitesIncrease timeout or disable JS
Bot detectionMissing anti-detectionEnsure FREECRAWL_ANTI_DETECT=true
Cache missesTTL too shortIncrease FREECRAWL_CACHE_TTL
Import errorsMissing dependenciesRun uvx freecrawl-mcp --test

Debug Mode

With uvx:

export FREECRAWL_LOG_LEVEL=DEBUG
uvx freecrawl-mcp --test

Local development:

export FREECRAWL_LOG_LEVEL=DEBUG
uv run freecrawl-mcp --test

📈 Monitoring & Observability

Health Metrics

  • Browser pool status
  • Memory and CPU usage
  • Cache hit rates
  • Request success rates
  • Response times

Logging

FreeCrawl provides structured logging with configurable levels:

  • ERROR: Critical failures
  • WARNING: Recoverable issues
  • INFO: General operations
  • DEBUG: Detailed troubleshooting

🔧 Development

Running Tests

With uvx:

# Basic functionality test
uvx freecrawl-mcp --test

Local development:

# Basic functionality test
uv run freecrawl-mcp --test

Code Structure

  • Core server: FreeCrawlServer class
  • Browser management: BrowserPool for resource pooling
  • Content extraction: ContentExtractor with multiple strategies
  • Caching: CacheManager with SQLite backend
  • Rate limiting: RateLimiter with token bucket algorithm

📄 License

This project is licensed under the MIT License - see the technical specification for details.

🤝 Contributing

  1. Fork the repository at https://github.com/dylan-gluck/freecrawl-mcp
  2. Create a feature branch
  3. Set up local development: uv sync
  4. Run tests: uv run freecrawl-mcp --test
  5. Submit a pull request

📚 Technical Specification

For detailed technical information, see ai_docs/FREECRAWL_TECHNICAL_SPEC.md.


FreeCrawl MCP Server - Self-hosted web scraping for the modern web 🚀