calmren/mcp-basic-web-crawler
If you are the rightful owner of mcp-basic-web-crawler and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
The MCP Web Content Extractor is a specialized server designed to extract clean text content from web pages, serving as a crucial component in research and data gathering workflows.
MCP Web Content Extractor
A specialized Model Context Protocol (MCP) server designed as a workflow component for research and data gathering pipelines. This tool extracts clean text content from web pages and provides search capabilities, specifically built to integrate with other tools in comprehensive research workflows.
🔄 Workflow Integration
This tool serves as the content extraction layer in research pipelines:
Search Tools → URL Lists → THIS TOOL → Clean Content → Processing Tools
Typical Research Workflow Position:
- Search: Get URLs from search tools (Brave, DuckDuckGo, custom APIs)
- Extract: ← THIS TOOL - Clean text extraction from simple sites
- Process: Feed extracted content to processors, summarizers, chunkers
- Complement: Works alongside Puppeteer for complex sites, RSS feeds for feeds
Input: Search queries or URL lists from upstream tools Output: Clean, structured text content for downstream processing
Core Capabilities
🔍 Search Integration
- Input: Search queries from workflow triggers
- Engine: DuckDuckGo (no API keys required)
- Output: Structured results (titles, URLs, snippets) for downstream tools
- Use Case: Initial discovery phase in research workflows
🌐 Content Extraction (Primary Function)
- Input: URL lists from search tools or manual specification
- Processing: Clean text extraction optimized for simple sites
- Output: Structured, clean content ready for processing tools
- Batch Support: Efficient processing of multiple URLs
- Memory Management: Handles large content volumes intelligently
🔧 Workflow Optimization
- Rate Limiting: Respects server resources and avoids blocks
- Error Handling: Graceful failures don't break pipeline execution
- Content Filtering: Removes navigation, ads, scripts for clean data
- Configurable Output: Adjustable content length for downstream tools
🛡️ Production-Ready Features
- Ethical Crawling: Respectful user agents and request patterns
- Resource Management: Memory-efficient processing prevents crashes
- Logging: Comprehensive monitoring for workflow debugging
- TypeScript: Type safety for reliable integration
🛠️ Future Enhancements
- Robots.txt Compliance: Respect crawl delays and disallow rules
- Advanced Error Handling: Retry mechanisms for transient failures
- Content Type Handling: Skip non-text content gracefully
- Character Encoding: Proper handling for international content
- Caching: Reduce redundant requests during development
Integration with MCP Clients
Claude Desktop
Option 1: NPX (Recommended)
{
"mcpServers": {
"web-crawler": {
"command": "npx",
"args": [
"mcp-basic-web-crawler",
"--search-rate-limit", "25",
"--fetch-rate-limit", "15",
"--log-level", "info"
],
"env": {
"MCP_BASIC_WEB_CRAWLER_USER_AGENT": "Basic Web Crawler/1.0"
}
}
}
}
Option 2: Global Installation
{
"mcpServers": {
"web-crawler": {
"command": "mcp-basic-web-crawler",
"args": ["--log-level", "info"]
}
}
}
Option 3: Docker
{
"mcpServers": {
"web-crawler": {
"command": "docker",
"args": [
"run", "--rm", "-i",
"--security-opt", "no-new-privileges:true",
"--memory", "512m",
"--cpus", "0.5",
"-e", "MCP_BASIC_WEB_CRAWLER_USER_AGENT=Basic Web Crawler/1.0",
"calmren/mcp-basic-web-crawler:latest",
"--search-rate-limit", "25",
"--fetch-rate-limit", "15",
"--log-level", "info"
]
}
}
}
Other MCP Clients
The server communicates via stdio and follows the MCP specification. It can be integrated with any MCP-compatible client.
Workflow Tools
This server provides two workflow-optimized tools:
1. web_search
- Discovery Tool
Purpose: Generate URL lists for content extraction Workflow Position: Step 1 (Search) → Feeds URLs to Step 2 (Extract)
Parameters:
query
(string): Search query from workflow triggermaxResults
(number, optional): URL limit for downstream processing (default: 10)
Workflow Example:
{
"query": "renewable energy storage solutions 2024",
"maxResults": 8
}
Output: Structured list of URLs + metadata for fetch_content
tool
2. fetch_content
- Extraction Tool
Purpose: Convert URLs to clean text content Workflow Position: Step 2 (Extract) → Feeds content to processing tools
Parameters:
url
(string | string[]): URLs from search results or manual input
Single URL (from search result):
{
"url": "https://example.com/research-article"
}
Batch Processing (typical workflow):
{
"url": [
"https://site1.com/article",
"https://site2.com/report",
"https://site3.com/analysis"
]
}
Output: Clean text content ready for summarization, chunking, or analysis
Research Workflow Examples
Complete Research Pipeline
1. Search Tools (Brave API, DuckDuckGo)
↓ (URLs)
2. THIS TOOL (Simple site extraction)
↓ (Clean content)
3. Processing Tools (Summarizers, chunkers)
↓ (Structured data)
4. Analysis Tools (Aggregators, rankers)
↓ (Insights)
5. Output Tools (Report generators, artifacts)
Complementary Tool Integration
- Simple Sites: This tool (fast, efficient)
- Complex Sites: Puppeteer server (JavaScript rendering)
- Feeds: RSS feed processors
- APIs: Custom search APIs
- Processing: Content processors, summarizers
- Storage: Vector databases, knowledge graphs
Typical Usage Pattern
# 1. Search for URLs
web_search("AI research 2024") → [url1, url2, url3...]
# 2. Extract content (this tool's primary function)
fetch_content([url1, url2, url3]) → clean_text_content
# 3. Process content (downstream tools)
process_content(clean_text_content) → structured_data
Installation
Method 1: NPX (Recommended)
# No installation needed - run directly with npx
npx mcp-basic-web-crawler --help
Method 2: Global Installation
npm install -g mcp-basic-web-crawler
Method 3: Docker
# Pull the image
docker pull calmren/mcp-basic-web-crawler:latest
# Or build locally
git clone https://github.com/calmren/mcp-basic-web-crawler.git
cd mcp-basic-web-crawler
docker build -t mcp-basic-web-crawler .
Method 4: From Source
git clone https://github.com/calmren/mcp-basic-web-crawler.git
cd mcp-basic-web-crawler
npm install
npm run build
Usage
NPX Usage (Recommended)
# Start the MCP server with npx
npx mcp-basic-web-crawler
# With custom configuration
npx mcp-basic-web-crawler --search-rate-limit 20 --log-level debug
Docker Usage
# Basic usage
docker run -p 3000:3000 calmren/mcp-basic-web-crawler
# With custom configuration
docker run -p 3000:3000 calmren/mcp-basic-web-crawler \
--search-rate-limit 20 --log-level debug
# With environment variables
docker run -p 3000:3000 \
-e MCP_WEB_CRAWLER_LOG_LEVEL=debug \
-e MCP_WEB_CRAWLER_USER_AGENT="MyApp/1.0" \
calmren/mcp-basic-web-crawler
Global Installation Usage
# If installed globally
mcp-basic-web-crawler --search-rate-limit 20 --log-level debug
Configuration Options
Option | Description | Default |
---|---|---|
--search-rate-limit <number> | Maximum search requests per minute | 30 |
--fetch-rate-limit <number> | Maximum fetch requests per minute | 20 |
--max-content-length <number> | Maximum content length to return | 8000 |
--timeout <number> | Request timeout in milliseconds | 30000 |
--user-agent <string> | Custom user agent string | Default MCP crawler UA |
--log-level <level> | Log level (error, warn, info, debug) | info |
--help, -h | Show help message | - |
Environment Variables
Variable | Description |
---|---|
MCP_WEB_CRAWLER_LOG_LEVEL | Set log level |
MCP_WEB_CRAWLER_USER_AGENT | Set custom user agent |
License
This MCP server is licensed under the MIT License. This means you are free to use, modify, and distribute the software, subject to the terms and conditions of the MIT License. For more details, please see the LICENSE file in the project repository.
Acknowledgments
- Built on the Model Context Protocol
- Uses DuckDuckGo for search functionality
- Powered by Cheerio for HTML parsing