softengineware/mcp-scrape
If you are the rightful owner of mcp-scrape and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
MCP-Scrape is a military-grade web scraping MCP server designed for precision, reliability, adaptability, and no failure.
MCP-Scrape: SEAL Team Six-Grade Web Scraping MCP Server
šÆ Mission Statement
MCP-Scrape is a military-grade web scraping MCP server that combines the most powerful scraping capabilities from multiple battle-tested tools. Built with SEAL Team Six principles: Precision, Reliability, Adaptability, and No Failure.
š Core Capabilities
Operation Modes
- Stealth Mode: Browser automation with anti-detection measures
- Rapid Strike: High-speed concurrent crawling with smart rate limiting
- Deep Recon: LLM-powered intelligent content extraction
- Search & Destroy: Advanced search with Google operators
- Persistent Intel: Vector storage for long-term memory
Weapon Systems
- Playwright Engine: Full browser automation with JavaScript rendering
- Firecrawl Integration: Cloud-based scraping with advanced features
- Readability Extraction: Clean content extraction from any webpage
- LLM Intelligence: Natural language instructions for data extraction
- Proxy Arsenal: Rotating proxies and user agents for evasion
- CAPTCHA Breaker: Multiple strategies for anti-bot bypass
š”ļø Features
Core Scraping Tools
scrape_url
: Single URL extraction with multiple fallback strategiescrawl_site
: Full site crawling with intelligent link followingsearch_web
: Advanced Google search with structured queriesextract_data
: LLM-powered data extraction with schemasscreenshot
: Visual capture of any webpageinteract
: Browser automation for dynamic content
Advanced Capabilities
- Multi-Strategy Approach: Automatically tries different methods until success
- Anti-Detection Suite: User agent rotation, proxy support, SSL bypass
- Rate Limiting: Intelligent throttling to avoid blocks
- Error Recovery: Exponential backoff and retry logic
- Content Cleaning: Readability + custom extractors for clean data
- Memory Integration: Optional vector storage for RAG applications
š Prerequisites
- Python 3.12+
- Node.js 18+ (for JavaScript tools)
- Docker (recommended for deployment)
- API Keys (optional):
- Firecrawl API key for cloud scraping
- Serper API key for search functionality
- OpenAI/Anthropic for LLM extraction
- Proxy service credentials
š§ Installation
Quick Start with Docker (Recommended)
# Clone the repository
git clone https://github.com/yourusername/mcp-scrape.git
cd mcp-scrape
# Build the Docker image
docker build -t mcp/scrape .
# Run with environment variables
docker run --env-file .env -p 8080:8080 mcp/scrape
Manual Installation
# Clone the repository
git clone https://github.com/yourusername/mcp-scrape.git
cd mcp-scrape
# Install Python dependencies
pip install -e .
# Install Node.js dependencies for JavaScript tools
cd js-tools && npm install && cd ..
# Install Playwright browsers
playwright install chromium
# Copy and configure environment
cp .env.example .env
# Edit .env with your configuration
āļø Configuration
Create a .env
file with the following variables:
# Transport Configuration
TRANSPORT=sse # or stdio
HOST=0.0.0.0
PORT=8080
# Scraping Engines
ENABLE_PLAYWRIGHT=true
ENABLE_FIRECRAWL=false # Requires API key
ENABLE_SERPER=false # Requires API key
# API Keys (optional)
FIRECRAWL_API_KEY= # For Firecrawl cloud scraping
SERPER_API_KEY= # For Google search
OPENAI_API_KEY= # For LLM extraction
# Proxy Configuration (optional)
PROXY_URL= # http://user:pass@proxy:port
PROXY_ROTATION=true
USER_AGENT_ROTATION=true
# Performance
MAX_CONCURRENT_REQUESTS=10
RATE_LIMIT_DELAY=1000 # milliseconds
TIMEOUT=30000 # milliseconds
# Storage (optional)
ENABLE_VECTOR_STORE=false
DATABASE_URL= # PostgreSQL for vector storage
š Running the Server
SSE Transport (API Mode)
# Using Python
python src/main.py
# Using Docker
docker run --env-file .env -p 8080:8080 mcp/scrape
Stdio Transport (Direct Integration)
Configure in your MCP client:
{
"mcpServers": {
"scrape": {
"command": "python",
"args": ["/path/to/mcp-scrape/src/main.py"],
"env": {
"TRANSPORT": "stdio",
"ENABLE_PLAYWRIGHT": "true"
}
}
}
}
š Usage Examples
Basic URL Scraping
// Scrape a single URL with automatic strategy selection
await use_mcp_tool("scrape", "scrape_url", {
url: "https://example.com",
extract_mode: "auto" // Tries all methods until success
});
Advanced Extraction with LLM
// Extract specific data using natural language
await use_mcp_tool("scrape", "extract_data", {
url: "https://news.site.com",
instruction: "Extract all article titles, authors, and publication dates",
schema: {
title: "string",
author: "string",
date: "date"
}
});
Full Site Crawling
// Crawl entire website with filters
await use_mcp_tool("scrape", "crawl_site", {
start_url: "https://docs.example.com",
max_depth: 3,
url_pattern: "/docs/*",
concurrent_limit: 5
});
Browser Interaction
// Interact with dynamic content
await use_mcp_tool("scrape", "interact", {
url: "https://app.example.com",
actions: [
{ type: "click", selector: "#login-button" },
{ type: "fill", selector: "#username", value: "user" },
{ type: "wait", time: 2000 },
{ type: "screenshot", filename: "result.png" }
]
});
šļø Architecture
Multi-Layer Scraping Strategy
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā MCP Client Request ā
āāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāāāāā
ā
āāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāāāāāāāāāā
ā Strategy Selector ā
ā (Chooses best approach for URL) ā
āāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāāāāā
ā
āāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāāāāāāāāāā
ā Scraping Engines ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¤
ā 1. Playwright (JS rendering) ā
ā 2. Firecrawl (Cloud API) ā
ā 3. HTTP Client (Simple HTML) ā
ā 4. Readability (Content extraction) ā
āāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāāāāā
ā
āāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāāāāāāāāāā
ā Anti-Detection Layer ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¤
ā ⢠Proxy rotation ā
ā ⢠User agent spoofing ā
ā ⢠Rate limiting ā
ā ⢠CAPTCHA handling ā
āāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāāāāā
ā
āāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāāāāāāāāāā
ā Content Processing ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¤
ā ⢠HTML cleaning ā
ā ⢠Markdown conversion ā
ā ⢠LLM extraction ā
ā ⢠Structured data parsing ā
āāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāāāāā
ā
āāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāāāāāāāāāā
ā Optional Storage ā
ā (Vector DB for RAG) ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
šļø SEAL Team Six Principles
This tool embodies elite military principles:
- Mission First: Every scraping request completes successfully
- Failure is Not an Option: Multiple fallback strategies ensure success
- Adapt and Overcome: Automatic strategy selection based on target
- Leave No Trace: Stealth mode with anti-detection measures
- Intelligence Driven: LLM-powered smart extraction
- Team Coordination: Modular architecture for easy extension
š Security & Ethics
- Always respect robots.txt and website terms of service
- Use rate limiting to avoid overwhelming servers
- Only scrape publicly available information
- Implement proper authentication when required
- Store sensitive data securely
š¤ Contributing
We welcome contributions that enhance our scraping capabilities:
- Fork the repository
- Create a feature branch
- Add your enhancement with tests
- Submit a pull request
š License
MIT License - See LICENSE file for details
š Acknowledgments
This project integrates the best features from:
šØ Remember: With great scraping power comes great responsibility. Use wisely.