IsaacIndex/webdocs-mcp-server
If you are the rightful owner of webdocs-mcp-server and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.
The Web Scraper MCP Server is a FastAPI-based server that implements the Minecraft Control Protocol (MCP) for web scraping tasks.
Web Scraper MCP Server
A web scraping server that implements the Minecraft Control Protocol (MCP) using FastAPI. This server provides tools for extracting content and links from web pages in a structured way.
Features
- Web content extraction with intelligent content cleaning
- Link extraction with full URL resolution
- Language detection
- Headless browser automation with Playwright by default or Selenium when cookies are required
- Open URLs in your existing browser session
- Multi-step website actions with Playwright
- FastAPI-based REST API
- MCP protocol implementation
- Streaming agent uses a planner, a per-step executor agent, and a summarizer
- Planner outputs
<plan>with a list of tool names outside of<think>
- Planner outputs
Prerequisites
- Python 3.11+
- Chrome browser installed
- If Chrome is not in your PATH, set the
CHROME_BINARYenvironment variable to the full path of the Chrome executable - The WebScraper can run in Playwright or Selenium mode. Playwright is the default unless cookie-based sessions are needed.
- uv package manager
Setup
- Create a virtual environment and install dependencies using uv:
uv venv
uv pip install -r requirements.txt
- Create a
.envfile (optional):
PORT=8000
Usage modes
MCP mode
Start the server:
python mcp_server.py [--log-level info|debug|warning|error|critical]
The server will start on http://localhost:8000 by default.
Agent mode
Legacy code is kept in agents_legacy.py for reference.
The old agents.py script is deprecated. Use agents_stream_tools.py for models that support the tools API. For other models, run agents_stream_prompt.py.
Run the interactive agent with tool support:
python agents_stream_tools.py [--debug] "your question here"
Use the --debug flag to print tool calls and intermediate messages.
For models without tool support:
python agents_stream_prompt.py "your question here"
Both agents follow a three-part workflow:
- Planner decides which tools to call and returns
<plan>containing only tool names outside of<think>. - Executor runs each step using the previous tool output as context.
- Summarizer uses the final tool output to answer the query.
API Endpoints
MCP Endpoints
Scrape Website
- URL:
/mcp - Method: POST
- Command:
scrape_website - Parameters:
{ "url": "https://example.com", "query": "specific topic" } - Response: Filtered content relevant to the query
Extract Links
- URL:
/mcp - Method: POST
- Command:
extract_links - Parameters:
{
"url": "https://example.com"
}
The server fetches the URL and returns all links found on the page.
- Relative paths are converted to absolute URLs based on the provided page.
- Response: List of all links found on the page
Download PDFs
- URL:
/mcp - Method: POST
- Command:
download_pdfs - Parameters:
{ "links": ["https://example.com/sample.pdf"] } - Response: Paths to the downloaded PDF files
Open Browser
- URL:
/mcp - Method: POST
- Command:
open_browser - Parameters:
{ "url": "https://example.com" } - Response: Confirmation that the URL was opened using your browser session
React Browser Task
- URL:
/mcp - Method: POST
- Command:
react_browser_task - Parameters:
{ "url": "https://example.com", "goal": "Click the next button and return the page text" } - Response: Final page content after completing the goal
Health Check
- URL:
/health - Method: GET
API Documentation
Once the server is running, you can access the API documentation at:
- Swagger UI:
http://localhost:8000/docs - ReDoc:
http://localhost:8000/redoc
Logging
Server logs are written to project_folder/logs/mcp.log and include:
- Server startup/shutdown events
- Web scraping operations
- Error messages and exceptions
The
agents_stream_tools.pyandagents_stream_prompt.pyscript logs toproject_folder/logs/agent.log. Both sets of logs are stored in the same directory and the agent output captures tool calls and the<think>sections for full traceability. You can adjust server verbosity with the--log-levelflag when starting the server. By default, server logs use thewarninglevel.
Project Structure
mcp_server.py: Core server implementation and MCP endpointsagents_stream_tools.py: Interactive agent for direct tool useagents_stream_prompt.py: Agent for models without Ollama tools supportmcp.json: MCP configuration filerequirements.txt: Python dependenciespyproject.toml: Project metadata and build configuration
Error Handling
The server includes comprehensive error handling for:
- Invalid URLs
- Network connectivity issues
- WebDriver initialization failures
- Content extraction errors
Contributing
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
Updating mcp.json Path
If you move the project directory, run the helper script to update the
mcp.json configuration:
python update_mcp_path.py
The updated JSON is copied to your clipboard so you can replace the content of
mcp.json easily.