mcp-scrape by softengineware - MCP Server

MCP-Scrape: SEAL Team Six-Grade Web Scraping MCP Server

🎯 Mission Statement

MCP-Scrape is a military-grade web scraping MCP server that combines the most powerful scraping capabilities from multiple battle-tested tools. Built with SEAL Team Six principles: Precision, Reliability, Adaptability, and No Failure.

🚀 Core Capabilities

Operation Modes

Stealth Mode: Browser automation with anti-detection measures
Rapid Strike: High-speed concurrent crawling with smart rate limiting
Deep Recon: LLM-powered intelligent content extraction
Search & Destroy: Advanced search with Google operators
Persistent Intel: Vector storage for long-term memory

Weapon Systems

Playwright Engine: Full browser automation with JavaScript rendering
Firecrawl Integration: Cloud-based scraping with advanced features
Readability Extraction: Clean content extraction from any webpage
LLM Intelligence: Natural language instructions for data extraction
Proxy Arsenal: Rotating proxies and user agents for evasion
CAPTCHA Breaker: Multiple strategies for anti-bot bypass

🛡️ Features

Core Scraping Tools

scrape_url: Single URL extraction with multiple fallback strategies
crawl_site: Full site crawling with intelligent link following
search_web: Advanced Google search with structured queries
extract_data: LLM-powered data extraction with schemas
screenshot: Visual capture of any webpage
interact: Browser automation for dynamic content

Advanced Capabilities

Multi-Strategy Approach: Automatically tries different methods until success
Anti-Detection Suite: User agent rotation, proxy support, SSL bypass
Rate Limiting: Intelligent throttling to avoid blocks
Error Recovery: Exponential backoff and retry logic
Content Cleaning: Readability + custom extractors for clean data
Memory Integration: Optional vector storage for RAG applications

📋 Prerequisites

Python 3.12+
Node.js 18+ (for JavaScript tools)
Docker (recommended for deployment)
API Keys (optional):
- Firecrawl API key for cloud scraping
- Serper API key for search functionality
- OpenAI/Anthropic for LLM extraction
- Proxy service credentials

🔧 Installation

Quick Start with Docker (Recommended)

# Clone the repository
git clone https://github.com/yourusername/mcp-scrape.git
cd mcp-scrape

# Build the Docker image
docker build -t mcp/scrape .

# Run with environment variables
docker run --env-file .env -p 8080:8080 mcp/scrape

Manual Installation

# Clone the repository
git clone https://github.com/yourusername/mcp-scrape.git
cd mcp-scrape

# Install Python dependencies
pip install -e .

# Install Node.js dependencies for JavaScript tools
cd js-tools && npm install && cd ..

# Install Playwright browsers
playwright install chromium

# Copy and configure environment
cp .env.example .env
# Edit .env with your configuration

⚙️ Configuration

Create a .env file with the following variables:

# Transport Configuration
TRANSPORT=sse              # or stdio
HOST=0.0.0.0
PORT=8080

# Scraping Engines
ENABLE_PLAYWRIGHT=true
ENABLE_FIRECRAWL=false    # Requires API key
ENABLE_SERPER=false       # Requires API key

# API Keys (optional)
FIRECRAWL_API_KEY=        # For Firecrawl cloud scraping
SERPER_API_KEY=           # For Google search
OPENAI_API_KEY=           # For LLM extraction

# Proxy Configuration (optional)
PROXY_URL=                # http://user:pass@proxy:port
PROXY_ROTATION=true
USER_AGENT_ROTATION=true

# Performance
MAX_CONCURRENT_REQUESTS=10
RATE_LIMIT_DELAY=1000     # milliseconds
TIMEOUT=30000             # milliseconds

# Storage (optional)
ENABLE_VECTOR_STORE=false
DATABASE_URL=             # PostgreSQL for vector storage

🚀 Running the Server

SSE Transport (API Mode)

# Using Python
python src/main.py

# Using Docker
docker run --env-file .env -p 8080:8080 mcp/scrape

Stdio Transport (Direct Integration)

Configure in your MCP client:

{
  "mcpServers": {
    "scrape": {
      "command": "python",
      "args": ["/path/to/mcp-scrape/src/main.py"],
      "env": {
        "TRANSPORT": "stdio",
        "ENABLE_PLAYWRIGHT": "true"
      }
    }
  }
}

📚 Usage Examples

Basic URL Scraping

// Scrape a single URL with automatic strategy selection
await use_mcp_tool("scrape", "scrape_url", {
  url: "https://example.com",
  extract_mode: "auto"  // Tries all methods until success
});

Advanced Extraction with LLM

// Extract specific data using natural language
await use_mcp_tool("scrape", "extract_data", {
  url: "https://news.site.com",
  instruction: "Extract all article titles, authors, and publication dates",
  schema: {
    title: "string",
    author: "string",
    date: "date"
  }
});

Full Site Crawling

// Crawl entire website with filters
await use_mcp_tool("scrape", "crawl_site", {
  start_url: "https://docs.example.com",
  max_depth: 3,
  url_pattern: "/docs/*",
  concurrent_limit: 5
});

Browser Interaction

// Interact with dynamic content
await use_mcp_tool("scrape", "interact", {
  url: "https://app.example.com",
  actions: [
    { type: "click", selector: "#login-button" },
    { type: "fill", selector: "#username", value: "user" },
    { type: "wait", time: 2000 },
    { type: "screenshot", filename: "result.png" }
  ]
});

🏗️ Architecture

Multi-Layer Scraping Strategy

┌─────────────────────────────────────────┐
│          MCP Client Request             │
└────────────────┬───────────────────────┘
                 │
┌────────────────▼───────────────────────┐
│         Strategy Selector              │
│   (Chooses best approach for URL)     │
└────────────────┬───────────────────────┘
                 │
┌────────────────▼───────────────────────┐
│          Scraping Engines              │
├─────────────────────────────────────────┤
│ 1. Playwright (JS rendering)           │
│ 2. Firecrawl (Cloud API)              │
│ 3. HTTP Client (Simple HTML)          │
│ 4. Readability (Content extraction)   │
└────────────────┬───────────────────────┘
                 │
┌────────────────▼───────────────────────┐
│       Anti-Detection Layer             │
├─────────────────────────────────────────┤
│ • Proxy rotation                       │
│ • User agent spoofing                  │
│ • Rate limiting                        │
│ • CAPTCHA handling                     │
└────────────────┬───────────────────────┘
                 │
┌────────────────▼───────────────────────┐
│        Content Processing              │
├─────────────────────────────────────────┤
│ • HTML cleaning                        │
│ • Markdown conversion                  │
│ • LLM extraction                       │
│ • Structured data parsing              │
└────────────────┬───────────────────────┘
                 │
┌────────────────▼───────────────────────┐
│         Optional Storage               │
│      (Vector DB for RAG)              │
└─────────────────────────────────────────┘

🎖️ SEAL Team Six Principles

This tool embodies elite military principles:

Mission First: Every scraping request completes successfully
Failure is Not an Option: Multiple fallback strategies ensure success
Adapt and Overcome: Automatic strategy selection based on target
Leave No Trace: Stealth mode with anti-detection measures
Intelligence Driven: LLM-powered smart extraction
Team Coordination: Modular architecture for easy extension

🔒 Security & Ethics

Always respect robots.txt and website terms of service
Use rate limiting to avoid overwhelming servers
Only scrape publicly available information
Implement proper authentication when required
Store sensitive data securely

🤝 Contributing

We welcome contributions that enhance our scraping capabilities:

Fork the repository
Create a feature branch
Add your enhancement with tests
Submit a pull request

📄 License

MIT License - See LICENSE file for details

🙏 Acknowledgments

This project integrates the best features from:

🚨 Remember: With great scraping power comes great responsibility. Use wisely.