mcp-basic-web-crawler

calmren/mcp-basic-web-crawler

3.2

If you are the rightful owner of mcp-basic-web-crawler and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

The MCP Web Content Extractor is a specialized server designed to extract clean text content from web pages, serving as a crucial component in research and data gathering workflows.

Tools
2
Resources
0
Prompts
0

MCP Web Content Extractor

A specialized Model Context Protocol (MCP) server designed as a workflow component for research and data gathering pipelines. This tool extracts clean text content from web pages and provides search capabilities, specifically built to integrate with other tools in comprehensive research workflows.

🔄 Workflow Integration

This tool serves as the content extraction layer in research pipelines:

Search Tools → URL Lists → THIS TOOL → Clean Content → Processing Tools

Typical Research Workflow Position:

  1. Search: Get URLs from search tools (Brave, DuckDuckGo, custom APIs)
  2. Extract: ← THIS TOOL - Clean text extraction from simple sites
  3. Process: Feed extracted content to processors, summarizers, chunkers
  4. Complement: Works alongside Puppeteer for complex sites, RSS feeds for feeds

Input: Search queries or URL lists from upstream tools Output: Clean, structured text content for downstream processing

Core Capabilities

🔍 Search Integration

  • Input: Search queries from workflow triggers
  • Engine: DuckDuckGo (no API keys required)
  • Output: Structured results (titles, URLs, snippets) for downstream tools
  • Use Case: Initial discovery phase in research workflows

🌐 Content Extraction (Primary Function)

  • Input: URL lists from search tools or manual specification
  • Processing: Clean text extraction optimized for simple sites
  • Output: Structured, clean content ready for processing tools
  • Batch Support: Efficient processing of multiple URLs
  • Memory Management: Handles large content volumes intelligently

🔧 Workflow Optimization

  • Rate Limiting: Respects server resources and avoids blocks
  • Error Handling: Graceful failures don't break pipeline execution
  • Content Filtering: Removes navigation, ads, scripts for clean data
  • Configurable Output: Adjustable content length for downstream tools

🛡️ Production-Ready Features

  • Ethical Crawling: Respectful user agents and request patterns
  • Resource Management: Memory-efficient processing prevents crashes
  • Logging: Comprehensive monitoring for workflow debugging
  • TypeScript: Type safety for reliable integration

🛠️ Future Enhancements

  • Robots.txt Compliance: Respect crawl delays and disallow rules
  • Advanced Error Handling: Retry mechanisms for transient failures
  • Content Type Handling: Skip non-text content gracefully
  • Character Encoding: Proper handling for international content
  • Caching: Reduce redundant requests during development

Integration with MCP Clients

Claude Desktop

Option 1: NPX (Recommended)
{
  "mcpServers": {
    "web-crawler": {
      "command": "npx",
      "args": [
        "mcp-basic-web-crawler",
        "--search-rate-limit", "25",
        "--fetch-rate-limit", "15",
        "--log-level", "info"
      ],
      "env": {
        "MCP_BASIC_WEB_CRAWLER_USER_AGENT": "Basic Web Crawler/1.0"
      }
    }
  }
}
Option 2: Global Installation
{
  "mcpServers": {
    "web-crawler": {
      "command": "mcp-basic-web-crawler",
      "args": ["--log-level", "info"]
    }
  }
}
Option 3: Docker
{
  "mcpServers": {
    "web-crawler": {
      "command": "docker",
      "args": [
        "run", "--rm", "-i",
        "--security-opt", "no-new-privileges:true",
        "--memory", "512m",
        "--cpus", "0.5",
        "-e", "MCP_BASIC_WEB_CRAWLER_USER_AGENT=Basic Web Crawler/1.0",
        "calmren/mcp-basic-web-crawler:latest",
        "--search-rate-limit", "25",
        "--fetch-rate-limit", "15",
        "--log-level", "info"
      ]
    }
  }
}

Other MCP Clients

The server communicates via stdio and follows the MCP specification. It can be integrated with any MCP-compatible client.

Workflow Tools

This server provides two workflow-optimized tools:

1. web_search - Discovery Tool

Purpose: Generate URL lists for content extraction Workflow Position: Step 1 (Search) → Feeds URLs to Step 2 (Extract)

Parameters:

  • query (string): Search query from workflow trigger
  • maxResults (number, optional): URL limit for downstream processing (default: 10)

Workflow Example:

{
  "query": "renewable energy storage solutions 2024",
  "maxResults": 8
}

Output: Structured list of URLs + metadata for fetch_content tool

2. fetch_content - Extraction Tool

Purpose: Convert URLs to clean text content Workflow Position: Step 2 (Extract) → Feeds content to processing tools

Parameters:

  • url (string | string[]): URLs from search results or manual input

Single URL (from search result):

{
  "url": "https://example.com/research-article"
}

Batch Processing (typical workflow):

{
  "url": [
    "https://site1.com/article",
    "https://site2.com/report",
    "https://site3.com/analysis"
  ]
}

Output: Clean text content ready for summarization, chunking, or analysis

Research Workflow Examples

Complete Research Pipeline

1. Search Tools (Brave API, DuckDuckGo)
   ↓ (URLs)
2. THIS TOOL (Simple site extraction)
   ↓ (Clean content)
3. Processing Tools (Summarizers, chunkers)
   ↓ (Structured data)
4. Analysis Tools (Aggregators, rankers)
   ↓ (Insights)
5. Output Tools (Report generators, artifacts)

Complementary Tool Integration

  • Simple Sites: This tool (fast, efficient)
  • Complex Sites: Puppeteer server (JavaScript rendering)
  • Feeds: RSS feed processors
  • APIs: Custom search APIs
  • Processing: Content processors, summarizers
  • Storage: Vector databases, knowledge graphs

Typical Usage Pattern

# 1. Search for URLs
web_search("AI research 2024")[url1, url2, url3...]

# 2. Extract content (this tool's primary function)
fetch_content([url1, url2, url3]) → clean_text_content

# 3. Process content (downstream tools)
process_content(clean_text_content) → structured_data

Installation

Method 1: NPX (Recommended)

# No installation needed - run directly with npx
npx mcp-basic-web-crawler --help

Method 2: Global Installation

npm install -g mcp-basic-web-crawler

Method 3: Docker

# Pull the image
docker pull calmren/mcp-basic-web-crawler:latest

# Or build locally
git clone https://github.com/calmren/mcp-basic-web-crawler.git
cd mcp-basic-web-crawler
docker build -t mcp-basic-web-crawler .

Method 4: From Source

git clone https://github.com/calmren/mcp-basic-web-crawler.git
cd mcp-basic-web-crawler
npm install
npm run build

Usage

NPX Usage (Recommended)

# Start the MCP server with npx
npx mcp-basic-web-crawler

# With custom configuration
npx mcp-basic-web-crawler --search-rate-limit 20 --log-level debug

Docker Usage

# Basic usage
docker run -p 3000:3000 calmren/mcp-basic-web-crawler

# With custom configuration
docker run -p 3000:3000 calmren/mcp-basic-web-crawler \
  --search-rate-limit 20 --log-level debug

# With environment variables
docker run -p 3000:3000 \
  -e MCP_WEB_CRAWLER_LOG_LEVEL=debug \
  -e MCP_WEB_CRAWLER_USER_AGENT="MyApp/1.0" \
  calmren/mcp-basic-web-crawler

Global Installation Usage

# If installed globally
mcp-basic-web-crawler --search-rate-limit 20 --log-level debug

Configuration Options

OptionDescriptionDefault
--search-rate-limit <number>Maximum search requests per minute30
--fetch-rate-limit <number>Maximum fetch requests per minute20
--max-content-length <number>Maximum content length to return8000
--timeout <number>Request timeout in milliseconds30000
--user-agent <string>Custom user agent stringDefault MCP crawler UA
--log-level <level>Log level (error, warn, info, debug)info
--help, -hShow help message-

Environment Variables

VariableDescription
MCP_WEB_CRAWLER_LOG_LEVELSet log level
MCP_WEB_CRAWLER_USER_AGENTSet custom user agent

License

This MCP server is licensed under the MIT License. This means you are free to use, modify, and distribute the software, subject to the terms and conditions of the MIT License. For more details, please see the LICENSE file in the project repository.

Acknowledgments