mcp-basic-web-crawler by calmren - MCP Server

MCP Web Content Extractor

A specialized Model Context Protocol (MCP) server designed as a workflow component for research and data gathering pipelines. This tool extracts clean text content from web pages and provides search capabilities, specifically built to integrate with other tools in comprehensive research workflows.

🔄 Workflow Integration

This tool serves as the content extraction layer in research pipelines:

Search Tools → URL Lists → THIS TOOL → Clean Content → Processing Tools

Typical Research Workflow Position:

Search: Get URLs from search tools (Brave, DuckDuckGo, custom APIs)
Extract: ← THIS TOOL - Clean text extraction from simple sites
Process: Feed extracted content to processors, summarizers, chunkers
Complement: Works alongside Puppeteer for complex sites, RSS feeds for feeds

Input: Search queries or URL lists from upstream tools Output: Clean, structured text content for downstream processing

Core Capabilities

🔍 Search Integration

Input: Search queries from workflow triggers
Engine: DuckDuckGo (no API keys required)
Output: Structured results (titles, URLs, snippets) for downstream tools
Use Case: Initial discovery phase in research workflows

🌐 Content Extraction (Primary Function)

Input: URL lists from search tools or manual specification
Processing: Clean text extraction optimized for simple sites
Output: Structured, clean content ready for processing tools
Batch Support: Efficient processing of multiple URLs
Memory Management: Handles large content volumes intelligently

🔧 Workflow Optimization

Rate Limiting: Respects server resources and avoids blocks
Error Handling: Graceful failures don't break pipeline execution
Content Filtering: Removes navigation, ads, scripts for clean data
Configurable Output: Adjustable content length for downstream tools

🛡️ Production-Ready Features

Ethical Crawling: Respectful user agents and request patterns
Resource Management: Memory-efficient processing prevents crashes
Logging: Comprehensive monitoring for workflow debugging
TypeScript: Type safety for reliable integration

🛠️ Future Enhancements

Robots.txt Compliance: Respect crawl delays and disallow rules
Advanced Error Handling: Retry mechanisms for transient failures
Content Type Handling: Skip non-text content gracefully
Character Encoding: Proper handling for international content
Caching: Reduce redundant requests during development

Integration with MCP Clients

Claude Desktop

Option 1: NPX (Recommended)

{
  "mcpServers": {
    "web-crawler": {
      "command": "npx",
      "args": [
        "mcp-basic-web-crawler",
        "--search-rate-limit", "25",
        "--fetch-rate-limit", "15",
        "--log-level", "info"
      ],
      "env": {
        "MCP_BASIC_WEB_CRAWLER_USER_AGENT": "Basic Web Crawler/1.0"
      }
    }
  }
}

Option 2: Global Installation

{
  "mcpServers": {
    "web-crawler": {
      "command": "mcp-basic-web-crawler",
      "args": ["--log-level", "info"]
    }
  }
}

Option 3: Docker

{
  "mcpServers": {
    "web-crawler": {
      "command": "docker",
      "args": [
        "run", "--rm", "-i",
        "--security-opt", "no-new-privileges:true",
        "--memory", "512m",
        "--cpus", "0.5",
        "-e", "MCP_BASIC_WEB_CRAWLER_USER_AGENT=Basic Web Crawler/1.0",
        "calmren/mcp-basic-web-crawler:latest",
        "--search-rate-limit", "25",
        "--fetch-rate-limit", "15",
        "--log-level", "info"
      ]
    }
  }
}

Other MCP Clients

The server communicates via stdio and follows the MCP specification. It can be integrated with any MCP-compatible client.

Workflow Tools

This server provides two workflow-optimized tools:

1. `web_search` - Discovery Tool

Purpose: Generate URL lists for content extraction Workflow Position: Step 1 (Search) → Feeds URLs to Step 2 (Extract)

Parameters:

query (string): Search query from workflow trigger
maxResults (number, optional): URL limit for downstream processing (default: 10)

Workflow Example:

{
  "query": "renewable energy storage solutions 2024",
  "maxResults": 8
}

Output: Structured list of URLs + metadata for fetch_content tool

2. `fetch_content` - Extraction Tool

Purpose: Convert URLs to clean text content Workflow Position: Step 2 (Extract) → Feeds content to processing tools

Parameters:

url (string | string[]): URLs from search results or manual input

Single URL (from search result):

{
  "url": "https://example.com/research-article"
}

Batch Processing (typical workflow):

{
  "url": [
    "https://site1.com/article",
    "https://site2.com/report",
    "https://site3.com/analysis"
  ]
}

Output: Clean text content ready for summarization, chunking, or analysis

Research Workflow Examples

Complete Research Pipeline

1. Search Tools (Brave API, DuckDuckGo)
   ↓ (URLs)
2. THIS TOOL (Simple site extraction)
   ↓ (Clean content)
3. Processing Tools (Summarizers, chunkers)
   ↓ (Structured data)
4. Analysis Tools (Aggregators, rankers)
   ↓ (Insights)
5. Output Tools (Report generators, artifacts)

Complementary Tool Integration

Simple Sites: This tool (fast, efficient)
Complex Sites: Puppeteer server (JavaScript rendering)
Feeds: RSS feed processors
APIs: Custom search APIs
Processing: Content processors, summarizers
Storage: Vector databases, knowledge graphs

Typical Usage Pattern

# 1. Search for URLs
web_search("AI research 2024") → [url1, url2, url3...]

# 2. Extract content (this tool's primary function)
fetch_content([url1, url2, url3]) → clean_text_content

# 3. Process content (downstream tools)
process_content(clean_text_content) → structured_data

Installation

Method 1: NPX (Recommended)

# No installation needed - run directly with npx
npx mcp-basic-web-crawler --help

Method 2: Global Installation

npm install -g mcp-basic-web-crawler

Method 3: Docker

# Pull the image
docker pull calmren/mcp-basic-web-crawler:latest

# Or build locally
git clone https://github.com/calmren/mcp-basic-web-crawler.git
cd mcp-basic-web-crawler
docker build -t mcp-basic-web-crawler .

Method 4: From Source

git clone https://github.com/calmren/mcp-basic-web-crawler.git
cd mcp-basic-web-crawler
npm install
npm run build

Usage

NPX Usage (Recommended)

# Start the MCP server with npx
npx mcp-basic-web-crawler

# With custom configuration
npx mcp-basic-web-crawler --search-rate-limit 20 --log-level debug

Docker Usage

# Basic usage
docker run -p 3000:3000 calmren/mcp-basic-web-crawler

# With custom configuration
docker run -p 3000:3000 calmren/mcp-basic-web-crawler \
  --search-rate-limit 20 --log-level debug

# With environment variables
docker run -p 3000:3000 \
  -e MCP_WEB_CRAWLER_LOG_LEVEL=debug \
  -e MCP_WEB_CRAWLER_USER_AGENT="MyApp/1.0" \
  calmren/mcp-basic-web-crawler

Global Installation Usage

# If installed globally
mcp-basic-web-crawler --search-rate-limit 20 --log-level debug

Configuration Options

Option	Description	Default
`--search-rate-limit <number>`	Maximum search requests per minute	30
`--fetch-rate-limit <number>`	Maximum fetch requests per minute	20
`--max-content-length <number>`	Maximum content length to return	8000
`--timeout <number>`	Request timeout in milliseconds	30000
`--user-agent <string>`	Custom user agent string	Default MCP crawler UA
`--log-level <level>`	Log level (error, warn, info, debug)	info
`--help, -h`	Show help message	-

Environment Variables

Variable	Description
`MCP_WEB_CRAWLER_LOG_LEVEL`	Set log level
`MCP_WEB_CRAWLER_USER_AGENT`	Set custom user agent

License

This MCP server is licensed under the MIT License. This means you are free to use, modify, and distribute the software, subject to the terms and conditions of the MIT License. For more details, please see the LICENSE file in the project repository.

Acknowledgments

Built on the Model Context Protocol
Uses DuckDuckGo for search functionality
Powered by Cheerio for HTML parsing