pymcp by rolfoz - MCP Server

Python MCP Web Search and Spider Tool

This Python script (pymcp.py) is an MCP (Model Control Protocol) server for web searching, website spidering, and local file reading. It supports tools like full-web-search, get-web-search-summaries, get-single-web-page-content, fetch_url_raw, spider_website, and read_local_file. It uses Selenium with webdriver-manager for robust web scraping on Ubuntu Linux, bypassing anti-bot measures, with a fallback to urllib for basic requests.

Features

Web Search: Query Bing, Google, or DuckDuckGo for search results with titles, URLs, and snippets (full-web-search, get-web-search-summaries).
Website Spidering: Crawl a website up to a specified depth, collecting page content (spider_website).
Single Page Fetch: Retrieve content from a specific URL (get-single-web-page-content, fetch_url_raw).
Local File Access: Read local files (read_local_file).
Debugging: Extensive logging to stderr and debug files (debug_search.html) for troubleshooting.
Optimized for Speed: Reduced timeouts and page limits to avoid MCP timeouts (e.g., spider_website capped at 10 pages, 30s).

Requirements

Python: 3.6+ (tested on 3.10+).

Ubuntu Linux Setup (as of October 11, 2025):

Install Google Chrome:

wget -q -O - https://dl.google.com/linux/linux_signing_key.pub | sudo apt-key add -
sudo sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list'
sudo apt update
sudo apt install google-chrome-stable

Verify: google-chrome --version (e.g., "Google Chrome 120.0.6099.71").

Install Python dependencies:
```
pip install selenium webdriver-manager
```
No manual ChromeDriver download needed; webdriver-manager handles it automatically.

Optional Manual ChromeDriver (if webdriver-manager fails):
- Download from https://googlechromelabs.github.io/chrome-for-testing/ (match Chrome version).
- Extract chromedriver to the script directory and uncomment the executable_path line in fetch_url.

Installation

Save the script as pymcp.py.
Ensure Chrome and dependencies are installed (see above).
Verify network access (for web requests) and file write permissions (for debug logs).

Usage

Terminal (Standalone)

Run the script as an MCP server, piping JSON-RPC commands:

echo '{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"spider_website","arguments":{"url":"https://lampdatabase.com","max_depth":2}}}' | python pymcp.py

Example for search:

echo '{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"full-web-search","arguments":{"query":"rolf schatzmann","limit":5}}}' | python pymcp.py

Output: JSON-RPC response with results (e.g., pages for spider_website or search results with URLs/snippets).

LM Studio

Load pymcp.py as an MCP server in LM Studio (Tools > Custom Scripts or equivalent).
Ensure "Network Access" is enabled in LM Studio settings.
Call tools via the interface or API, e.g., spider_website({"url":"https://lampdatabase.com","max_depth":2}).

Tools

full-web-search: Search with full page content (query, limit=1-10, includeContent=true).
get-web-search-summaries: Search with only titles/URLs/snippets (query, limit=1-10).
get-single-web-page-content: Fetch one URL’s content (url, maxContentLength=5000).
fetch_url_raw: Fetch raw HTML (url).
spider_website: Crawl a site up to depth 2 (url, max_depth=2, max 10 pages).
read_local_file: Read a local file (path).

Expected Output for `spider_website`

For spider_website({"url":"https://lampdatabase.com","max_depth":2}):

{
  "pages": {
    "https://lampdatabase.com/": "<html content truncated... (homepage HTML)>",
    "https://lampdatabase.com/contact.php": "<html content truncated... (contact form HTML)>"
  }
}

Completes in ~10-15s (2 pages, Selenium fetch).
Debug logs in debug_search.html and stderr.

Troubleshooting

"Selenium failed": Check Chrome installation (google-chrome --version) and pip install selenium webdriver-manager. Ensure ChromeDriver matches Chrome version.
Timeout (-32001): Increase LM Studio’s MCP timeout or reduce max_depth=1. Check stderr for "Crawled X/Y pages".
Empty Results: Inspect debug_search.html for HTML content. If minimal (e.g., just ""), anti-bot measures are active; Selenium should resolve this.
LM Studio Sandbox: Ensure "Allow subprocesses" and "Network Access" are enabled in settings.
Logs: Check stderr for "Debug: Crawling...", "Selenium fetched...", or errors.

Notes

Selenium uses headless Chrome for full page rendering, bypassing anti-bot measures (e.g., Bing’s block pages).
spider_website is capped at 10 pages and 30s to prevent timeouts.
Debug files (debug_search.html) are written to the script directory for inspection.
If issues persist, share stderr logs or debug_search.html contents (first 500 chars).

Notes

Artifact ID: Used a new UUID (f8e616e0-d894-4936-a3f5-391682ee794d) since no prior README exists in our history to update.
Content: Covers Ubuntu setup, spider_website fixes (timeout handling, reduced limits), and aligns with the latest script’s features (Selenium, webdriver-manager, optimized crawling).

Testing: After saving as README.md, verify it with:

cat README.md

Then test the script:

echo '{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"spider_website","arguments":{"url":"https://lampdatabase.com","max_depth":2}}}' | python pymcp.py

If Issues: Share new logs if spider_website still times out—could be LM Studio’s timeout setting or Selenium setup. We can adjust max_spider_time or try Firefox/GeckoDriver.

This README should guide users through setup and usage while addressing the spider_website timeout issue. Let me know if you need tweaks or further testing!

rolfoz/pymcp

Python MCP Web Search and Spider Tool

Features

Requirements

Installation

Usage

Terminal (Standalone)

LM Studio

Tools

Expected Output for spider_website

Troubleshooting

Notes

Notes

Expected Output for `spider_website`