IsaacIndex/webdocs-mcp-server
If you are the rightful owner of webdocs-mcp-server and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
The Web Scraper MCP Server is a FastAPI-based server that implements the Minecraft Control Protocol (MCP) for web scraping tasks.
scrape_website
Extracts and cleans content from a specified webpage.
extract_links
Retrieves all links from a specified webpage.
download_pdfs_from_text
Downloads PDF files from URLs found in the provided text.
ping
Checks server status and version information.
open_browser
Opens a specified URL in the user's browser session.
Web Scraper MCP Server
A web scraping server that implements the Minecraft Control Protocol (MCP) using FastAPI. This server provides tools for extracting content and links from web pages in a structured way.
Features
- Web content extraction with intelligent content cleaning
- Link extraction with full URL resolution
- Language detection
- Headless browser automation with Playwright by default or Selenium when cookies are required
- Open URLs in your existing browser session
- Multi-step website actions with Playwright
- FastAPI-based REST API
- MCP protocol implementation
Prerequisites
- Python 3.11+
- Chrome browser installed
- If Chrome is not in your PATH, set the
CHROME_BINARY
environment variable to the full path of the Chrome executable - The WebScraper can run in Playwright or Selenium mode. Playwright is the default unless cookie-based sessions are needed.
- uv package manager
Setup
- Create a virtual environment and install dependencies using uv:
uv venv
uv pip install -r requirements.txt
- Create a
.env
file (optional):
PORT=8000
Usage modes
MCP mode
Start the server:
python mcp_server.py [--log-level info|debug|warning|error|critical]
The server will start on http://localhost:8000
by default.
Agent mode
The old agents.py
script is deprecated. Use agents_stream_tools.py
instead.
Legacy code is kept in agents_legacy.py
for reference.
Run the interactive agent:
python agents_stream_tools.py "your question here"
API Endpoints
MCP Endpoints
Scrape Website
- URL:
/mcp
- Method: POST
- Command:
scrape_website
- Parameters:
{ "url": "https://example.com", "query": "specific topic" }
- Response: Filtered content relevant to the query
Extract Links
- URL:
/mcp
- Method: POST
- Command:
extract_links
- Parameters:
{
"url": "https://example.com"
}
The server fetches the URL and returns all links found on the page.
- Relative paths are converted to absolute URLs based on the provided page.
- Response: List of all links found on the page
Download PDFs
- URL:
/mcp
- Method: POST
- Command:
download_pdfs
- Parameters:
{ "links": ["https://example.com/sample.pdf"] }
- Response: Paths to the downloaded PDF files
Ping
- URL:
/mcp
- Method: POST
- Command:
ping
- Response: Server status and version information
Open Browser
- URL:
/mcp
- Method: POST
- Command:
open_browser
- Parameters:
{ "url": "https://example.com" }
- Response: Confirmation that the URL was opened using your browser session
React Browser Task
- URL:
/mcp
- Method: POST
- Command:
react_browser_task
- Parameters:
{ "url": "https://example.com", "goal": "Click the next button and return the page text" }
- Response: Final page content after completing the goal
Health Check
- URL:
/health
- Method: GET
API Documentation
Once the server is running, you can access the API documentation at:
- Swagger UI:
http://localhost:8000/docs
- ReDoc:
http://localhost:8000/redoc
Logging
Server logs are written to project_folder/logs/mcp.log
and include:
- Server startup/shutdown events
- Web scraping operations
- Error messages and exceptions
The
agents_stream_tools.py
script logs toproject_folder/logs/agent.log
. Both sets of logs are stored in the same directory and the agent output captures tool calls and the<think>
sections for full traceability. You can adjust server verbosity with the--log-level
flag when starting the server. By default, server logs use thewarning
level.
Project Structure
mcp_server.py
: Core server implementation and MCP endpointsagents_stream_tools.py
: Interactive agent for direct tool usemcp.json
: MCP configuration filerequirements.txt
: Python dependenciespyproject.toml
: Project metadata and build configuration
Error Handling
The server includes comprehensive error handling for:
- Invalid URLs
- Network connectivity issues
- WebDriver initialization failures
- Content extraction errors
Contributing
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
Updating mcp.json
Path
If you move the project directory, run the helper script to update the
mcp.json
configuration:
python update_mcp_path.py
The updated JSON is copied to your clipboard so you can replace the content of
mcp.json
easily.