freecrawl-mcp by dylan-gluck - MCP Server

FreeCrawl MCP Server

A production-ready Model Context Protocol (MCP) server for web scraping and document processing, designed as a self-hosted replacement for Firecrawl.

🚀 Features

JavaScript-enabled web scraping with Playwright and anti-detection measures
Document processing with fallback support for various formats
Concurrent batch processing with configurable limits
Intelligent caching with SQLite backend
Rate limiting per domain
Comprehensive error handling with retry logic
Easy installation via uvx or local development setup
Health monitoring and metrics collection

MCP Config (using `uvx`)

{
  "mcpServers": {
    "freecrawl": {
      "command": "uvx",
      "args": ["freecrawl-mcp"],
    }
  }
}

📦 Installation & Usage

Quick Start with uvx (Recommended)

The easiest way to use FreeCrawl is with uvx, which automatically manages dependencies:

# Install browsers on first run
uvx freecrawl-mcp --install-browsers

# Test functionality
uvx freecrawl-mcp --test

Local Development Setup

For local development or customization:

Clone from GitHub:

git clone https://github.com/dylan-gluck/freecrawl-mcp.git
cd freecrawl-mcp

Set up environment:

# Sync dependencies
uv sync

# Install browser dependencies
uv run freecrawl-mcp --install-browsers

# Run tests
uv run freecrawl-mcp --test

Run the server:
```
uv run freecrawl-mcp
```

🛠 Configuration

Configure FreeCrawl using environment variables:

Basic Configuration

# Transport (stdio for MCP, http for REST API)
export FREECRAWL_TRANSPORT=stdio

# Browser pool settings
export FREECRAWL_MAX_BROWSERS=3
export FREECRAWL_HEADLESS=true

# Concurrency limits
export FREECRAWL_MAX_CONCURRENT=10
export FREECRAWL_MAX_PER_DOMAIN=3

# Cache settings
export FREECRAWL_CACHE=true
export FREECRAWL_CACHE_DIR=/tmp/freecrawl_cache
export FREECRAWL_CACHE_TTL=3600
export FREECRAWL_CACHE_SIZE=536870912  # 512MB

# Rate limiting
export FREECRAWL_RATE_LIMIT=60  # requests per minute

# Logging
export FREECRAWL_LOG_LEVEL=INFO

Security Settings

# API authentication (optional)
export FREECRAWL_REQUIRE_API_KEY=false
export FREECRAWL_API_KEYS=key1,key2,key3

# Domain blocking
export FREECRAWL_BLOCKED_DOMAINS=localhost,127.0.0.1

# Anti-detection
export FREECRAWL_ANTI_DETECT=true
export FREECRAWL_ROTATE_UA=true

🔧 MCP Tools

FreeCrawl provides the following MCP tools:

`freecrawl_scrape`

Scrape content from a single URL with advanced options.

Parameters:

url (string): URL to scrape
formats (array): Output formats - ["markdown", "html", "text", "screenshot", "structured"]
javascript (boolean): Enable JavaScript execution (default: true)
wait_for (string, optional): CSS selector or time (ms) to wait
anti_bot (boolean): Enable anti-detection measures (default: true)
headers (object, optional): Custom HTTP headers
cookies (object, optional): Custom cookies
cache (boolean): Use cached results if available (default: true)
timeout (number): Total timeout in milliseconds (default: 30000)

Example:

{
  "name": "freecrawl_scrape",
  "arguments": {
    "url": "https://example.com",
    "formats": ["markdown", "screenshot"],
    "javascript": true,
    "wait_for": "2000"
  }
}

`freecrawl_batch_scrape`

Scrape multiple URLs concurrently.

Parameters:

urls (array): List of URLs to scrape (max 100)
concurrency (number): Maximum concurrent requests (default: 5)
formats (array): Output formats (default: ["markdown"])
common_options (object, optional): Options applied to all URLs
continue_on_error (boolean): Continue if individual URLs fail (default: true)

Example:

{
  "name": "freecrawl_batch_scrape",
  "arguments": {
    "urls": [
      "https://example.com/page1",
      "https://example.com/page2"
    ],
    "concurrency": 3,
    "formats": ["markdown", "text"]
  }
}

`freecrawl_extract`

Extract structured data using schema-driven approach.

Parameters:

url (string): URL to extract data from
schema (object): JSON Schema or Pydantic model definition
prompt (string, optional): Custom extraction instructions
validation (boolean): Validate against schema (default: true)
multiple (boolean): Extract multiple matching items (default: false)

Example:

{
  "name": "freecrawl_extract",
  "arguments": {
    "url": "https://example.com/product",
    "schema": {
      "type": "object",
      "properties": {
        "title": {"type": "string"},
        "price": {"type": "number"}
      }
    }
  }
}

`freecrawl_process_document`

Process documents (PDF, DOCX, etc.) with OCR support.

Parameters:

file_path (string, optional): Path to document file
url (string, optional): URL to download document from
strategy (string): Processing strategy - "fast", "hi_res", "ocr_only" (default: "hi_res")
formats (array): Output formats - ["markdown", "structured", "text"]
languages (array, optional): OCR languages (e.g., ["eng", "fra"])
extract_images (boolean): Extract embedded images (default: false)
extract_tables (boolean): Extract and structure tables (default: true)

Example:

{
  "name": "freecrawl_process_document",
  "arguments": {
    "url": "https://example.com/document.pdf",
    "strategy": "hi_res",
    "formats": ["markdown", "structured"]
  }
}

`freecrawl_health_check`

Get server health status and metrics.

Example:

{
  "name": "freecrawl_health_check",
  "arguments": {}
}

🔄 Integration with Claude Code

MCP Configuration

Add FreeCrawl to your MCP configuration:

Using uvx (Recommended):

{
  "mcpServers": {
    "freecrawl": {
      "command": "uvx",
      "args": ["freecrawl-mcp"]
    }
  }
}

Using local development setup:

{
  "mcpServers": {
    "freecrawl": {
      "command": "uv",
      "args": ["run", "freecrawl-mcp"],
      "cwd": "/path/to/freecrawl-mcp"
    }
  }
}

Usage in Prompts

Please scrape the content from https://example.com and extract the main article text in markdown format.

Claude Code will automatically use the freecrawl_scrape tool to fetch and process the content.

🚀 Performance & Scalability

Resource Usage

Memory: ~100MB base + ~50MB per browser instance
CPU: Moderate usage during active scraping
Storage: Cache grows based on configured limits

Throughput

Single requests: 2-5 seconds typical response time
Batch processing: 10-50 concurrent requests depending on configuration
Cache hit ratio: 30%+ for repeated content

Optimization Tips

Enable caching for frequently accessed content
Adjust concurrency based on target site rate limits
Use appropriate formats - markdown is faster than screenshots
Configure rate limiting to avoid being blocked

🛡 Security Considerations

Anti-Detection

Rotating user agents
Realistic browser fingerprints
Request timing randomization
JavaScript execution in sandboxed environment

Input Validation

URL format validation
Private IP blocking
Domain blocklist support
Request size limits

Resource Protection

Memory usage monitoring
Browser pool size limits
Request timeout enforcement
Rate limiting per domain

🔧 Troubleshooting

Common Issues

Issue	Possible Cause	Solution
High memory usage	Too many browser instances	Reduce `FREECRAWL_MAX_BROWSERS`
Slow responses	JavaScript-heavy sites	Increase timeout or disable JS
Bot detection	Missing anti-detection	Ensure `FREECRAWL_ANTI_DETECT=true`
Cache misses	TTL too short	Increase `FREECRAWL_CACHE_TTL`
Import errors	Missing dependencies	Run `uvx freecrawl-mcp --test`

Debug Mode

With uvx:

export FREECRAWL_LOG_LEVEL=DEBUG
uvx freecrawl-mcp --test

Local development:

export FREECRAWL_LOG_LEVEL=DEBUG
uv run freecrawl-mcp --test

📈 Monitoring & Observability

Health Metrics

Browser pool status
Memory and CPU usage
Cache hit rates
Request success rates
Response times

Logging

FreeCrawl provides structured logging with configurable levels:

ERROR: Critical failures
WARNING: Recoverable issues
INFO: General operations
DEBUG: Detailed troubleshooting

🔧 Development

Running Tests

With uvx:

# Basic functionality test
uvx freecrawl-mcp --test

Local development:

# Basic functionality test
uv run freecrawl-mcp --test

Code Structure

Core server: FreeCrawlServer class
Browser management: BrowserPool for resource pooling
Content extraction: ContentExtractor with multiple strategies
Caching: CacheManager with SQLite backend
Rate limiting: RateLimiter with token bucket algorithm

📄 License

This project is licensed under the MIT License - see the technical specification for details.

🤝 Contributing

Fork the repository at https://github.com/dylan-gluck/freecrawl-mcp
Create a feature branch
Set up local development: uv sync
Run tests: uv run freecrawl-mcp --test
Submit a pull request

📚 Technical Specification

For detailed technical information, see ai_docs/FREECRAWL_TECHNICAL_SPEC.md.

FreeCrawl MCP Server - Self-hosted web scraping for the modern web 🚀

dylan-gluck/freecrawl-mcp

FreeCrawl MCP Server

🚀 Features

MCP Config (using uvx)

📦 Installation & Usage

Quick Start with uvx (Recommended)

Local Development Setup

🛠 Configuration

Basic Configuration

Security Settings

🔧 MCP Tools

freecrawl_scrape

freecrawl_batch_scrape

freecrawl_extract

freecrawl_process_document

freecrawl_health_check

🔄 Integration with Claude Code

MCP Configuration

Usage in Prompts

🚀 Performance & Scalability

Resource Usage

Throughput

Optimization Tips

🛡 Security Considerations

Anti-Detection

Input Validation

Resource Protection

🔧 Troubleshooting

Common Issues

Debug Mode

📈 Monitoring & Observability

Health Metrics

Logging

🔧 Development

Running Tests

Code Structure

📄 License

🤝 Contributing

📚 Technical Specification

MCP Config (using `uvx`)

`freecrawl_scrape`

`freecrawl_batch_scrape`

`freecrawl_extract`

`freecrawl_process_document`

`freecrawl_health_check`