Webscraper-MCP-Server by Huxley-Brown - MCP Server

MCP WebScraper

MCP WebScraper is a local web‑scraping service that exposes simple tools over the Model Context Protocol (MCP). It can scrape static pages (HTTPX) and JavaScript‑heavy pages (Playwright). The primary use is to run it locally as an MCP server so Cursor can call tools like scrape_url, scrape_batch, and validate_selectors.

Quick start (run as an MCP server in Cursor)

Requirements: Python 3.9+, internet access to install Playwright browsers

Clone and create a virtual environment

git clone https://github.com/Huxley-Brown/mcp-webscraper.git
cd mcp-webscraper
python -m venv .venv
. .venv/bin/activate   # On Windows: .venv\\Scripts\\activate

Install the project and browsers

pip install -e .
python -m playwright install chromium

Add a project MCP config (recommended) Create ./.cursor/mcp.json in the project root with your absolute path:

{
  "servers": {
    "webscraper": {
      "command": "/absolute/path/to/webscraper/.venv/bin/python",
      "args": ["-m", "mcp_webscraper.mcp_server"]
    }
  }
}

Restart Cursor
Use in Cursor chat

@webscraper scrape_url url="https://quotes.toscrape.com/"

That’s it. Cursor manages the MCP server process automatically.

Optional: CLI or REST

CLI (single URL):

python -m src.mcp_webscraper.cli scrape --url https://quotes.toscrape.com/

API server (with MCP HTTP endpoint mounted at /mcp):

uvicorn src.mcp_webscraper.api.main:app --host 0.0.0.0 --port 8000

Troubleshooting (after moving folders)

If you moved the project and the server won’t start:

Recreate venv and reinstall:

rm -rf .venv && python -m venv .venv
. .venv/bin/activate
pip install -e .
python -m playwright install chromium

Ensure ./.cursor/mcp.json points to the new absolute path of .venv/bin/python.
Prefer the project‑level config over a global ~/.cursor/mcp.json to avoid path drift.

What tools it provides

scrape_url(url, custom_selectors?, force_dynamic?)
scrape_batch(urls[], custom_selectors?, force_dynamic?)
validate_selectors(url, selectors{})

For more examples, see docs/MCP_INTEGRATION.md.

MCP WebScraper

A production-ready local web scraping service with dynamic page support, designed as an MCP (Model Context Protocol) server for AI agents. Features comprehensive error handling, intelligent JavaScript detection, and enterprise-grade reliability.

🌟 Features

Core Capabilities

Smart Content Detection: Advanced JavaScript detection with sophisticated scoring system
Dual Scraping Modes: Efficient static (HTTPX) + dynamic (Playwright) with automatic routing
MCP Server Integration: Native MCP protocol support with three specialized tools
Non-blocking Processing: Asynchronous job queue with real-time status tracking
Local-only: No external dependencies or cloud services required

MCP Tools Available

scrape_url: Scrape a single URL with custom selectors and metadata extraction
scrape_batch: Process multiple URLs efficiently with combined results
validate_selectors: Test CSS selectors before scraping to ensure accuracy
MCP Resources: Access configuration and job status information

Enterprise Features

Circuit Breakers: Automatic failure detection and recovery for unreliable domains
Anti-scraping Measures: User agent rotation, robots.txt compliance, rate limiting
Comprehensive Error Handling: Retry logic with exponential backoff
Resource Management: Configurable concurrency limits and memory controls
Performance Monitoring: Detailed statistics and health check endpoints

Developer Experience

Rich CLI Interface: Progress bars, colored output, and verbose logging
Complete Test Suite: Unit, integration, and real-world validation tests
Comprehensive Documentation: MCP integration guides and testing documentation
Full Configuration: 25+ environment-configurable parameters

🚀 Performance

⚡ Processing Speed: 1.4-1.8 seconds per job
🎯 Success Rate: 100% on test sites
🧠 Smart Routing: Automatic static/dynamic detection
📊 Concurrent Jobs: Up to 5 simultaneous scraping operations

🏗️ Architecture

FastAPI REST server with auto-generated OpenAPI documentation
MCP Server mounted at /mcp with streamable HTTP transport
HTTPX for efficient HTTP requests with HTTP/2 support
Playwright for JavaScript-heavy pages (headless Chromium)
BeautifulSoup + lxml for robust HTML parsing
Typer for rich command-line interface with progress indicators
AsyncIO job queue with background workers
Pydantic for configuration management and data validation

📦 Installation

Requirements

Python 3.9 or higher
~200MB disk space for Playwright browsers

Quick Setup

Clone and install:

git clone https://github.com/Huxley-Brown/mcp-webscraper.git
cd mcp-webscraper
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -e .

Install Playwright browsers:
```
python -m playwright install chromium
```

Quick test:

make run  # Start the API server with MCP endpoint
# In another terminal:
python -m src.mcp_webscraper.cli scrape --url https://quotes.toscrape.com/

📖 Usage

MCP Integration (Recommended for AI Agents)

Cursor IDE Setup

Create .cursor/mcp.json in your project root (recommended):
```
{
 "mcp servers"
   "servers": {
     "webscraper": {
       "command": "/absolute/path/to/webscraper/.venv/bin/python",
       "args": ["-m", "mcp_webscraper.mcp_server"]
     }
   }
}
```
- This uses the installed package entry (mcp_webscraper.mcp_server) from your local venv and requires pip install -e . to have been run in that venv.
- Alternative (run from source tree): use "args": ["-m", "src.mcp_webscraper.mcp_server"] and ensure the project root is on PYTHONPATH or set "cwd" to the project directory.

Restart Cursor and use in chat:

@webscraper scrape_url url="https://quotes.toscrape.com/" custom_selectors='{"container": ".quote", "text": ".text", "author": ".author"}'

Python MCP Client

import asyncio
from mcp import ClientSession
from mcp.client.streamable_http import streamablehttp_client

async def scrape_with_mcp():
    # Connect to MCP server (requires API server running)
    async with streamablehttp_client("http://localhost:8000/mcp") as (read, write, _):
        async with ClientSession(read, write) as session:
            await session.initialize()
            
            # Use MCP tools
            result = await session.call_tool("scrape_url", {
                "url": "https://quotes.toscrape.com/",
                "custom_selectors": {
                    "container": ".quote",
                    "text": ".text",
                    "author": ".author"
                }
            })
            
            scraped_data = result.content[0].json
            print(f"Scraped {scraped_data['data_count']} items")
            return scraped_data

# Run the MCP client
asyncio.run(scrape_with_mcp())

CLI Interface

Single URL with rich output:

python -m src.mcp_webscraper.cli scrape --url https://quotes.toscrape.com/ --verbose

Batch processing:

python -m src.mcp_webscraper.cli scrape --list-file inputs/urls.json --output-dir ./results

Custom selectors:

python -m src.mcp_webscraper.cli scrape \
  --url https://news.ycombinator.com/ \
  --selectors '{"container": ".athing", "title": ".titleline > a"}'

Input file formats:

JSON:

{
  "urls": [
    "https://quotes.toscrape.com/",
    "https://httpbin.org/html"
  ]
}

CSV:

url
https://quotes.toscrape.com/
https://httpbin.org/html

REST API

Start the server:

make run
# Or manually:
uvicorn src.mcp_webscraper.api.main:app --host 0.0.0.0 --port 8000

Submit a job with custom selectors:

curl -X POST "http://localhost:8000/scrape" \
     -H "Content-Type: application/json" \
     -d '{
           "input_type": "url",
           "target": "https://news.ycombinator.com/",
           "custom_selectors": {
             "container": ".athing",
             "title": ".titleline > a",
             "score": ".score"
           }
         }'

Monitor job progress:

JOB_ID="your-job-id"
curl "http://localhost:8000/status/$JOB_ID"
curl "http://localhost:8000/results/$JOB_ID"

API Endpoints

Endpoint	Method	Description
`/`	GET	Service information and status
`/mcp/*`	*	MCP server endpoints (tools and resources)
`/scrape`	POST	Submit scraping job
`/status/{job_id}`	GET	Check job status
`/results/{job_id}`	GET	Download results
`/jobs`	GET	List recent jobs
`/jobs/{job_id}`	DELETE	Cancel running job
`/config`	GET	View current configuration
`/stats`	GET	Basic system statistics
`/stats/detailed`	GET	Detailed performance metrics
`/health`	GET	Health check endpoint
`/docs`	GET	Interactive API documentation

📊 Output Format

Results are saved as structured JSON files:

{
  "job_id": "9b5328c8",
  "source_url": "https://quotes.toscrape.com/",
  "scrape_timestamp": "2025-01-20T11:14:23Z",
  "status": "completed",
  "extraction_method": "static",
  "data": [
    {
      "text": "The world as we have created it is a process of our thinking...",
      "title": null,
      "url": "https://quotes.toscrape.com/",
      "metadata": {
        "author": "Albert Einstein",
        "tags": ["change", "deep-thoughts", "thinking"]
      }
    }
  ],
  "metadata": {
    "processing_time_seconds": 1.82,
    "data_items_count": 10,
    "html_size_bytes": 15420
  }
}

⚙️ Configuration

Environment Variables

Create a .env file or set environment variables:

# Server Configuration
HOST=0.0.0.0
PORT=8000
DEBUG=false
LOG_LEVEL=INFO

# Resource Limits
MAX_CONCURRENT_JOBS=5
MAX_PLAYWRIGHT_INSTANCES=3
MAX_QUEUE_SIZE=100
OUTPUT_DIR=./scrapes_out

# Scraping Behavior
REQUEST_DELAY=1.0
TIMEOUT=30
MAX_RETRIES=3
RESPECT_ROBOTS_TXT=true
USER_AGENT_ROTATION=true

# Anti-Scraping Measures
MAX_CONCURRENT_PER_DOMAIN=2
CIRCUIT_BREAKER_FAILURE_THRESHOLD=5
CIRCUIT_BREAKER_RECOVERY_TIMEOUT=300

# CORS and Security
CORS_ORIGINS=*
API_KEY_ENABLED=false

# Monitoring
ENABLE_STATS=true
LOG_FILE=logs/webscraper.log

Advanced Configuration

See for all 25+ configurable parameters.

🧪 Testing

Run the Test Suite

# Install test dependencies
pip install pytest pytest-asyncio pytest-cov

# Run all tests
pytest

# Run with coverage
pytest --cov=src/mcp_webscraper --cov-report=html

# Run specific test categories
pytest -m "not integration"      # Skip integration tests
pytest tests/test_core.py -v     # Run specific test file

Test Categories

Unit Tests: Core functionality, configuration, error handling
Integration Tests: End-to-end workflows, real website testing
Performance Tests: Load testing, memory usage validation

See for comprehensive testing guides.

📚 Documentation

: Complete examples for AI agent integration
: Comprehensive testing documentation
: Technical architecture details
: Original project analysis and design

🏗️ Project Structure

webscraper/
├── src/mcp_webscraper/         # Main package
│   ├── api/                    # FastAPI REST endpoints
│   ├── core/                   # Core scraping logic
│   │   ├── scraper.py         # Main scraper class
│   │   ├── detector.py        # JavaScript detection
│   │   ├── anti_scraping.py   # Anti-scraping measures
│   │   └── error_handling.py  # Error handling & circuit breakers
│   ├── jobs/                   # Job queue and workers
│   ├── models/                 # Pydantic data models
│   ├── config/                 # Configuration management
│   ├── mcp_server.py          # MCP protocol server
│   └── cli.py                  # Command-line interface
├── tests/                      # Comprehensive test suite
├── docs/                       # Documentation
├── inputs/                     # Sample input files
└── scrapes_out/               # Default output directory

🔧 Development

Development Setup

# Install with development dependencies
pip install -e ".[dev]"

# Install pre-commit hooks
pre-commit install

# Run development server with auto-reload
make dev

Troubleshooting (after moving directories)

Recreate the virtual environment and reinstall deps:

rm -rf .venv
python -m venv .venv
. .venv/bin/activate
pip install -e .
python -m playwright install chromium

Update Cursor config paths if they referenced the old location. Prefer the project-level ./.cursor/mcp.json shown above.

Quality Assurance

# Run full test suite
make test

# Code formatting
make format

# Type checking
make typecheck

# Linting
make lint

🚀 MCP Integration Examples

Advanced MCP Client Usage

import asyncio
from mcp import ClientSession
from mcp.client.streamable_http import streamablehttp_client

async def advanced_mcp_scraping():
    """Advanced MCP integration with error handling and batch processing."""
    
    async with streamablehttp_client("http://localhost:8000/mcp") as (read, write, _):
        async with ClientSession(read, write) as session:
            await session.initialize()
            
            # Validate selectors first
            validation = await session.call_tool("validate_selectors", {
                "url": "https://news.ycombinator.com/",
                "selectors": {
                    "container": ".athing",
                    "title": ".titleline > a",
                    "score": ".score"
                }
            })
            
            valid_selectors = validation.content[0].json['valid_selectors']
            print(f"Valid selectors: {valid_selectors}")
            
            # Batch scrape multiple URLs
            urls = [
                "https://quotes.toscrape.com/page/1/",
                "https://quotes.toscrape.com/page/2/"
            ]
            
            batch_result = await session.call_tool("scrape_batch", {
                "urls": urls,
                "custom_selectors": {
                    "container": ".quote",
                    "text": ".text",
                    "author": ".author"
                }
            })
            
            batch_data = batch_result.content[0].json
            print(f"Batch scraped {batch_data['total_items']} items from {batch_data['total_urls']} URLs")
            
            return batch_data

# Run advanced example
asyncio.run(advanced_mcp_scraping())

AI Content Aggregator Example

async def ai_content_aggregator():
    """Example AI agent that aggregates content from multiple sources using MCP."""
    
    sources = [
        {
            "name": "Hacker News",
            "url": "https://news.ycombinator.com/",
            "selectors": {"container": ".athing", "title": ".titleline > a"}
        },
        {
            "name": "Quotes",
            "url": "https://quotes.toscrape.com/",
            "selectors": {"container": ".quote", "text": ".text", "author": ".author"}
        }
    ]
    
    async with streamablehttp_client("http://localhost:8000/mcp") as (read, write, _):
        async with ClientSession(read, write) as session:
            await session.initialize()
            
            aggregated_content = []
            
            for source in sources:
                result = await session.call_tool("scrape_url", {
                    "url": source["url"],
                    "custom_selectors": source["selectors"]
                })
                
                scrape_data = result.content[0].json
                for item in scrape_data['data']:
                    aggregated_content.append({
                        "source": source["name"],
                        "data": item,
                        "extraction_method": scrape_data["extraction_method"]
                    })
            
            return aggregated_content

# Run aggregator
content = asyncio.run(ai_content_aggregator())
print(f"Aggregated {len(content)} items from multiple sources")

See for complete examples and Cursor setup.

📈 Performance Benchmarks

Based on real-world testing:

Site Type	Processing Time	Success Rate	Method
Static HTML	1.4-1.8s	100%	HTTPX
JavaScript SPA	2.5-4.2s	98%	Playwright
News Sites	1.2-2.1s	100%	Auto-detect
E-commerce	2.8-5.1s	95%	Mixed

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes with tests
Run the test suite (make test)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

MIT License - see file for details.

🙏 Acknowledgments

Built for the Model Context Protocol (MCP) ecosystem
Designed for ethical web scraping with respect for robots.txt and rate limits
Optimized for AI agent integration and automation workflows

⚠️ Ethical Usage: This tool is designed for ethical web scraping. Always respect robots.txt, rate limits, and website terms of service. Use responsibly and in compliance with applicable laws and regulations.