mcp-webscraper by samirsaci - MCP Server

MCP Web Scraper for Claude Desktop

A Model Context Protocol (MCP) server that enables Claude Desktop to perform advanced web scraping and crawling operations. Extract structured data, analyze website architectures, and discover content relationships - all through natural conversation with Claude.

🎯 Features

Static & Dynamic Scraping: Handle both regular HTML and JavaScript-rendered pages
Website Crawling: Discover and map entire website structures
Data Extraction: Extract specific elements using CSS selectors
Batch Operations: Process multiple URLs efficiently
Link Analysis: Understand how pages connect and reference each other

🎥 Watch the Tutorial

See the full demo and step-by-step setup guide on YouTube:

📋 Prerequisites

Python 3.10 or higher
WSL2 with Ubuntu (for Windows users)
Claude Desktop application
uv package manager

🚀 Installation

1. Clone the Repository

git clone https://github.com/samirsaci/mcp-webscraper.git
cd mcp-webscraper

2. Install uv Package Manager

If you don't have uv installed:

curl -LsSf https://astral.sh/uv/install.sh | sh

3. Initialize the project

# Initialize the virtual environment
uv init .

4. Install Dependencies

uv add "mcp[cli]"
source .venv/bin/activate
uv pip install -r requirements.txt

Do not forget to install playwright browser to scrape dynamic content

uv run playwright install chromium

5. Test the Installation

Run the test script to verify everything works using a website that loves to be scrapped https://books.toscrape.com/:

uv run python test_local.py

Expected Output:

Static Scraping Success: True
HTML length: 51294
---------
Dynamic Scraping Success: True
HTML length: 51004
---------
Testing Crawler...
Crawler Success: True
Pages crawled: 5
Pages discovered: 437
Failed URLs: 0

First 3 pages discovered:
  1. All products | Books to Scrape - Sandbox
     URL: https://books.toscrape.com/
     Links found: 73
     Depth: 0
  2. All products | Books to Scrape - Sandbox
     URL: https://books.toscrape.com/index.html
     Links found: 73
     Depth: 1
  3. Books |
     Books to Scrape - Sandbox
     URL: https://books.toscrape.com/catalogue/category/books_1/index.html
     Links found: 73
     Depth: 1

Statistics:
  Total unique links: 104
  Max depth reached: 1
  Avg load time: 0.21s

⚙️ Claude Desktop Configuration

For Windows Users with WSL

Locate your Claude Desktop configuration file:

File -> Settings -> Edit Config

Add the WebScrappingServer configuration

{
  "mcpServers": {
    "WebScrapingServer": {
      "command": "wsl",
      "args": [
        "-d",
        "Ubuntu",
        "bash",
        "-lc",
        "cd ~/path/to/mcp-webscraper && uv run --with mcp[cli] mcp run scrapping.py"
      ]
    }
  }
}

Important: Replace ~/path/to/mcp-webscraper with the actual path to your project folder in WSL. To find your WSL path:

pwd

3. Restart Claude Desktop

After updating the configuration:

Completely quit Claude Desktop (not just close the window)
Start Claude Desktop again
Look for the 🔌 icon in the text input area
Click it to verify "WebScrapingServer" appears

🔧 Usage Examples

Once configured, you can ask Claude to:

Basic Scraping

"Scrape the homepage of example.com and tell me what you find"

Advanced SEO analysis

Please help me to crawl my personal blog https://yourblog.com with a limit of 150 pages.
I would like to understand how articles are referring to each other.
Can you help me to perform this type of analysis?

📁 Project Structure

mcp-webscraper/
├── models/
│   └── scraping_models.py      # Pydantic models for data validation
├── utils/
│   └── web_scraper.py          # Core WebScraper class
├── scrapping.py                 # MCP server implementation
├── test_local.py                # Local testing script
├── requirements.txt             # Python dependencies
├── README.md                    # This file
└── scraping_server.log          # Server logs (created at runtime)

🛠️ Available MCP Tools

The server exposes these tools to Claude:

scrape_url: Get raw HTML from any webpage
extract_data: Extract multiple elements using CSS selectors
extract_first: Get a single element from a page
batch_scrape: Process multiple URLs
crawl_website: Discover and map website structure

🐛 Troubleshooting

Server not appearing in Claude

*If the server does not appear in Claude, try first to restart Claude Desktop by terminating its processus.`

If this does not work, try to

Check the log file:

cat scraping_server.log

Verify the path in config matches your WSL path:

pwd

The output should match what you have in your config file.

Test the server directly:

uv run python scrapping.py

Playwright issues

If JavaScript scraping fails, try to reinstall the browser

uv run playwright install chromium

WSL-specific issues

Ensure WSL2 is properly installed:

Run this in Windows PowerShell opened as Administrator

wsl --status

📄 License

MIT License - feel free to use this in your own projects!

About me 🤓

Senior Supply Chain and Data Science consultant with international experience working on Logistics and Transportation operations. For consulting or advising on analytics and sustainable supply chain transformation, feel free to contact me via Logigreen Consulting or LinkedIn