webcrawl-mcp

webcrawl-mcp

3.2

If you are the rightful owner of webcrawl-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

The Webcrawl MCP Server is a production-ready implementation of the Model Context Protocol (MCP) designed for comprehensive web crawling and intelligent content extraction.

๐Ÿ† Webcrawl MCP Server - Production Ready Implementation ๐Ÿ†

โ–ˆโ–ˆโ–ˆโ•—   โ–ˆโ–ˆโ–ˆโ•— โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•—โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•—     โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•— โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•—โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•—
โ–ˆโ–ˆโ–ˆโ–ˆโ•— โ–ˆโ–ˆโ–ˆโ–ˆโ•‘โ–ˆโ–ˆโ•”โ•โ•โ•โ•โ•โ–ˆโ–ˆโ•”โ•โ•โ–ˆโ–ˆโ•—    โ–ˆโ–ˆโ•”โ•โ•โ–ˆโ–ˆโ•—โ–ˆโ–ˆโ•”โ•โ•โ•โ•โ•โ–ˆโ–ˆโ•”โ•โ•โ•โ•โ•
โ–ˆโ–ˆโ•”โ–ˆโ–ˆโ–ˆโ–ˆโ•”โ–ˆโ–ˆโ•‘โ–ˆโ–ˆโ•‘     โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•”โ•    โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•”โ•โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•—  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•—  
โ–ˆโ–ˆโ•‘โ•šโ–ˆโ–ˆโ•”โ•โ–ˆโ–ˆโ•‘โ–ˆโ–ˆโ•‘     โ–ˆโ–ˆโ•”โ•โ•โ•โ•     โ–ˆโ–ˆโ•”โ•โ•โ–ˆโ–ˆโ•—โ–ˆโ–ˆโ•”โ•โ•โ•  โ–ˆโ–ˆโ•”โ•โ•โ•  
โ–ˆโ–ˆโ•‘ โ•šโ•โ• โ–ˆโ–ˆโ•‘โ•šโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•—โ–ˆโ–ˆโ•‘         โ–ˆโ–ˆโ•‘  โ–ˆโ–ˆโ•‘โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•—โ–ˆโ–ˆโ•‘     
โ•šโ•โ•     โ•šโ•โ• โ•šโ•โ•โ•โ•โ•โ•โ•šโ•โ•         โ•šโ•โ•  โ•šโ•โ•โ•šโ•โ•โ•โ•โ•โ•โ•โ•šโ•โ•     
     
   100% MCP COMPLIANT | ABORT CAPABLE โœ…

A production-ready Model Context Protocol (MCP) server providing comprehensive web crawling capabilities. Features intelligent content extraction, tool execution abort functionality, and full MCP specification compliance.

๐ŸŽฏ Key Features

  • โœ… 100% MCP Compliant - Full compliance with MCP specification v2024-11-05
  • ๐Ÿ›‘ Abort Functionality - Cancel long-running operations gracefully
  • ๐Ÿง  Smart Crawling - Intelligent content extraction with relevance scoring
  • ๐Ÿ”— Link Analysis - Advanced link extraction and categorization
  • ๐Ÿ—บ๏ธ Sitemap Generation - Comprehensive website structure mapping
  • ๐Ÿ” Content Search - In-page content search and analysis
  • ๐ŸŒ Web Search - Integrated web search capabilities
  • โฐ Utility Tools - Date/time and general utility functions

๏ฟฝ Quick Start

Using Docker (Recommended)

docker-compose up --build

Local Development

# Install dependencies
npm install
cd mcp-service && npm install

# Start development server
npm run dev

๐Ÿ› ๏ธ Available Tools

ToolDescriptionKey Features
crawlBasic web crawlingPage content, metadata, links
smartCrawlIntelligent crawlingRelevance scoring, smart navigation
extractLinksLink extractionCategorization, filtering, analysis
searchInPageContent searchText matching, context extraction
generateSitemapSitemap creationSite structure, hierarchy mapping
webSearchWeb searchMulti-engine support, result filtering
getDateTimeDate/time utilitiesTimezone support, formatting

๐Ÿ›‘ Abort Functionality

All tools support graceful cancellation:

# Start a tool execution
curl -X POST http://localhost:3000/mcp \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"tools/call","params":{"name":"crawl","arguments":{"url":"https://example.com"}},"id":"tool-123"}'

# Abort the execution
curl -X POST http://localhost:3000/tools/abort/tool-123

๐Ÿ“‹ MCP Protocol Support

  • โœ… Session Management: Complete session lifecycle with Mcp-Session-Id headers
  • โœ… Protocol Initialization: Proper initialize and notifications/initialized handshake
  • โœ… Capability Negotiation: Full support for protocol version negotiation
  • โœ… Transport Protocols: Modern Streamable HTTP and SSE support
  • โœ… Standard Methods: tools/list, tools/call, resources/list, resources/read
  • โœ… Error Handling: All JSON-RPC 2.0 error codes properly implemented

๐Ÿ—๏ธ Architecture

mcp-service/
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ services/tools/         # Self-contained tool implementations
โ”‚   โ”‚   โ”œโ”€โ”€ BaseTool.ts         # Base class with abort functionality
โ”‚   โ”‚   โ”œโ”€โ”€ CrawlTool.ts        # Web crawling
โ”‚   โ”‚   โ”œโ”€โ”€ SmartCrawlTool.ts   # Intelligent crawling
โ”‚   โ”‚   โ”œโ”€โ”€ ExtractLinksTool.ts # Link extraction
โ”‚   โ”‚   โ”œโ”€โ”€ SearchInPageTool.ts # Page search
โ”‚   โ”‚   โ”œโ”€โ”€ SitemapTool.ts      # Sitemap generation
โ”‚   โ”‚   โ”œโ”€โ”€ WebSearchTool.ts    # Web search
โ”‚   โ”‚   โ””โ”€โ”€ DateTimeTool.ts     # Date/time utilities
โ”‚   โ”œโ”€โ”€ mcp/                    # MCP protocol implementation
โ”‚   โ”œโ”€โ”€ routes/                 # HTTP endpoints
โ”‚   โ”œโ”€โ”€ controllers/            # Request handlers
โ”‚   โ””โ”€โ”€ config/                 # Configuration management

๐Ÿ”ง Configuration

Environment Variables

# Copy template and configure
cp .env-template .env

# Key settings
PORT=3000                       # Server port
MCP_TRANSPORT=streamable-http   # Transport protocol
PUPPETEER_HEADLESS=true        # Browser mode
LOG_LEVEL=info                 # Logging level

Development vs Production

  • Development: Relaxed rate limiting (1000 requests/minute)
  • Production: Secure defaults with stricter limits

๐Ÿงช Testing

# Run MCP compliance tests
npm run test:mcp-compliance

# Run all tests
npm test

# Build project
npm run build

### Running Locally

1. Install dependencies:
   ```bash
   npm install
  1. Navigate to the mcp-service directory:
    cd mcp-service
    npm install
    
  2. Define environment variables (see Configuration).
  3. Build and start the server:
    npm run build
    npm start
    
    # Or for development with auto-reload:
    npm run dev
    ## ๐ŸŒ API Endpoints
    
EndpointProtocolDescription
POST /mcpMCP Streamable HTTPRecommended - Modern JSON-RPC over HTTP
GET /mcpHTTPServer info and connection details
POST /tools/abort/:toolIdRESTAbort tool execution
GET /api/healthRESTHealth check
GET /api/versionRESTVersion information

๐Ÿ“ Usage Examples

Basic Tool Execution

# Execute crawl tool
curl -X POST http://localhost:3000/mcp \
  -H "Content-Type: application/json" \
  -d '{
    "jsonrpc": "2.0",
    "method": "tools/call",
    "params": {
      "name": "crawl",
      "arguments": {
        "url": "https://example.com",
        "maxPages": 3
      }
    },
    "id": 1
  }'

Smart Crawl with Query

# Execute smart crawl with relevance scoring
curl -X POST http://localhost:3000/mcp \
  -H "Content-Type: application/json" \
  -d '{
    "jsonrpc": "2.0",
    "method": "tools/call",
    "params": {
      "name": "smartCrawl",
      "arguments": {
        "url": "https://example.com",
        "query": "pricing information",
        "maxPages": 5,
        "relevanceThreshold": 3
      }
    },
    "id": 2
  }'

Tool Execution with Abort

# Start tool execution
TOOL_ID=$(curl -X POST http://localhost:3000/mcp \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"tools/call","params":{"name":"generateSitemap","arguments":{"url":"https://large-site.com","maxPages":100}},"id":"sitemap-1"}' \
  | jq -r '.result.toolId')

# Abort if needed
curl -X POST http://localhost:3000/tools/abort/$TOOL_ID

๐Ÿ“š Documentation

DocumentDescription
Complete API reference and examples
Tool abort functionality guide
Architecture and design overview
Developer documentation
Contribution guidelines

๐Ÿค Contributing

We welcome contributions! Please see for guidelines.

๐Ÿ“„ License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). This means you are free to:

  • Share โ€” copy and redistribute the material in any medium or format
  • Adapt โ€” remix, transform, and build upon the material

Under the following terms:

  • Attribution โ€” You must give appropriate credit
  • NonCommercial โ€” You may not use the material for commercial purposes

See the file for the complete license text.


Made with โค๏ธ for the MCP community "name": "smartCrawl", "description": "Crawl a website and return markdown-formatted content, potentially answering a specific query.", "parameterDescription": "URL to crawl, optional crawling parameters, and an optional query.", "returnDescription": "Object containing success status, original URL, markdown content, and optional error message.", "endpoint": "/api/tools/smartCrawl" }, { "name": "extractLinks", "description": "Extract and categorize links from a web page.", "parameterDescription": "URL to extract links from.", "returnDescription": "Object containing internal and external links with their descriptions.", "endpoint": "/api/tools/extractLinks" }, { "name": "sitemapGenerator", "description": "Generate a sitemap from a website by crawling its pages.", "parameterDescription": "URL to crawl for sitemap generation with optional depth and maxPages.", "returnDescription": "Object containing the generated sitemap structure.", "endpoint": "/api/tools/sitemapGenerator" }, { "name": "searchInPage", "description": "Search for specific content within a web page.", "parameterDescription": "URL and search term to look for.", "returnDescription": "Object containing search results and matches.", "endpoint": "/api/tools/searchInPage" }, { "name": "webSearch", "description": "Perform web search and extract results.", "parameterDescription": "Search query and maximum number of results.", "returnDescription": "Object containing search results with titles, descriptions, and URLs.", "endpoint": "/api/tools/webSearch" }, { "name": "dateTime", "description": "Get current date and time information in various formats.", "parameterDescription": "Optional timezone parameter.", "returnDescription": "Object containing current date/time in multiple formats.", "endpoint": "/api/tools/dateTime" } ] }


### Direct Tool Execution

#### Crawl Tool
```bash
curl -X POST http://localhost:${PORT:-3000}/api/tools/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "maxPages": 3,
    "depth": 1,
    "strategy": "bfs",
    "captureScreenshots": true,
    "captureNetworkTraffic": false,
    "waitTime": 2000
  }'
Smart Crawl Tool (Markdown Output)
curl -X POST http://localhost:${PORT:-3000}/api/tools/smartCrawl \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "query": "What is this site about?",
    "maxPages": 2,
    "depth": 1
  }'
Extract Links Tool
curl -X POST http://localhost:${PORT:-3000}/api/tools/extractLinks \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com"
  }'
Sitemap Generator Tool
curl -X POST http://localhost:${PORT:-3000}/api/tools/sitemapGenerator \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "maxPages": 10,
    "depth": 2
  }'
Search In Page Tool
curl -X POST http://localhost:${PORT:-3000}/api/tools/searchInPage \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "searchTerm": "contact information"
  }'
Web Search Tool
curl -X POST http://localhost:${PORT:-3000}/api/tools/webSearch \
  -H "Content-Type: application/json" \
  -d '{
    "query": "MCP protocol specification",
    "maxResults": 5
  }'
Date Time Tool
curl -X POST http://localhost:${PORT:-3000}/api/tools/dateTime \
  -H "Content-Type: application/json" \
  -d '{
    "timezone": "UTC"
  }'

Environment Variables

  • PORT (default: 3000): Port for the MCP server.
  • NODE_ENV (default: development): Environment mode (development or production).
  • MCP_NAME, MCP_VERSION, MCP_DESCRIPTION: MCP server identification.
  • CRAWL_DEFAULT_MAX_PAGES (default: 10): Default maximum pages to crawl.
  • CRAWL_DEFAULT_DEPTH (default: 3): Default crawl depth.
  • CRAWL_DEFAULT_STRATEGY (default: bfs): Default crawl strategy (bfs|dfs|bestFirst).
  • CRAWL_DEFAULT_WAIT_TIME (default: 1000): Default wait time in ms between requests.
  • LOG_LEVEL (default: info): Logging level (debug|info|warn|error).
  • CACHE_TTL (default: 3600): Cache TTL in seconds.
  • MAX_REQUEST_SIZE (default: 10mb): Maximum HTTP payload size.
  • CORS_ORIGINS (default: *): Allowed origins for CORS.
  • RATE_LIMIT_WINDOW (default: 900000): Rate limit window in milliseconds (15 minutes).
  • RATE_LIMIT_MAX_REQUESTS (default: 100): Max requests per rate limit window.
  • PUPPETEER_EXECUTABLE_PATH (optional): Custom path to Chrome/Chromium executable for Puppeteer.
  • PUPPETEER_SKIP_DOWNLOAD (default: false): Skip automatic Chromium download during installation.

Configuration

All configuration is centralized in the src/config directory with separate modules for different aspects of the system:

  • appConfig.ts: Core application settings
  • mcpConfig.ts: MCP server specific settings
  • securityConfig.ts: Security-related settings
  • crawlConfig.ts: Web crawling default parameters

Key Components

  • Server: Unified server implementation integrating Express and MCP capabilities.
  • Routes: Organized in separate files for API and MCP endpoints.
  • SimpleMcpServer: Implements MCP discovery and tool invocation logic.
  • Controllers: toolController and resourceController for handling business logic.
  • Tools: Self-contained tool implementations extending the BaseTool abstract class:
    • CrawlTool: Basic web crawling with content extraction
    • SmartCrawlTool: Intelligent crawling with markdown output and query support
    • ExtractLinksTool: Extract and categorize links from web pages
    • SitemapTool: Generate sitemaps from crawled content
    • SearchInPageTool: Search for specific content within web pages
    • WebSearchTool: Web search functionality with result extraction
    • DateTimeTool: Date and time utility functions
  • Configuration: Centralized configuration system with module-specific settings.
  • MCP Transport: Supports both modern Streamable HTTP and legacy SSE transport methods.
  • Tool Architecture: Each tool includes its own browser management, crawling logic, and error handling for complete self-containment.

Web Crawling Features

The MCP server includes a comprehensive set of self-contained crawling tools, each with integrated browser management and specialized functionality:

Tool-Based Architecture

  • Self-Contained Design: Each tool manages its own Puppeteer browser instance and crawling logic
  • BaseTool Abstract Class: Common functionality shared across all tools with standardized error handling
  • Independent Operation: Tools operate without shared service dependencies for improved reliability

Available Tools

CrawlTool
  • Basic Web Crawling: Extract text content and tables from web pages
  • Multiple Strategies: BFS, DFS, and best-first crawling approaches
  • Screenshot Capture: Optional full-page screenshot functionality
  • Content Extraction: Intelligent visible text extraction skipping hidden elements
SmartCrawlTool
  • Markdown Output: HTML to Markdown conversion with custom formatting rules
  • Query-Based Crawling: Focused content extraction based on specific queries
  • Smart Content Detection: Enhanced relevance scoring for different content types
  • Structured Output: Well-formatted markdown with proper headings and sections
ExtractLinksTool
  • Link Categorization: Separate internal and external link extraction
  • Link Analysis: Extract link text, URLs, and descriptions
  • Comprehensive Coverage: Find all clickable links within page content
SitemapTool
  • Sitemap Generation: Create structured sitemaps from crawled content
  • Hierarchical Structure: Organize pages by depth and relationship
  • Configurable Depth: Control crawling depth for sitemap generation
SearchInPageTool
  • Content Search: Find specific terms or phrases within web pages
  • Context Extraction: Provide surrounding context for search matches
  • Relevance Scoring: Rank search results by relevance
WebSearchTool
  • Web Search Integration: Perform web searches and extract results
  • Result Processing: Extract titles, descriptions, and URLs from search results
  • Configurable Results: Control number of results returned
DateTimeTool
  • Time Utilities: Current date/time in multiple formats
  • Timezone Support: Convert times across different timezones
  • Formatting Options: Various date/time format outputs

Shared Browser Management Features

  • Robust Error Handling: Improved browser initialization with proper error handling and fallback mechanisms
  • Custom Chrome Path: Support for custom Chrome/Chromium executable paths via PUPPETEER_EXECUTABLE_PATH
  • Resource Management: Automatic browser cleanup and connection management in each tool
  • Independent Instances: Each tool manages its own browser instance for better isolation

Content Extraction (Across Tools)

  • Text Extraction: Intelligent visible text extraction that skips hidden elements and scripts
  • Markdown Conversion: HTML to Markdown conversion with custom rules for tables and code blocks
  • Table Extraction: Structured extraction of HTML tables with caption support
  • Screenshot Capture: Optional full-page screenshot functionality
  • Link Processing: Extract and categorize internal vs external links

Crawling Strategies

  • Breadth-First Search (BFS): Default strategy for systematic exploration
  • Depth-First Search (DFS): Prioritizes deeper paths in the site structure
  • Best-First Search: Simple heuristic-based crawling (shorter paths first)

Advanced Options

  • Network Traffic Monitoring: Optional tracking of HTTP requests during crawling
  • Multi-page Crawling: Support for crawling multiple pages with depth control
  • Wait Time Configuration: Configurable delays between page loads
  • Screenshot Capture: Optional full-page screenshots saved to temporary directory

Error Resilience

  • Browser Launch Fallbacks: Multiple strategies for browser initialization
  • Connection Recovery: Automatic reconnection handling for disconnected browsers
  • URL Validation: Robust URL parsing and validation with error handling
  • Timeout Management: Configurable timeouts to prevent hanging on slow pages

Customization

  • Add new routes in the routes directory.
  • Extend MCP capabilities by modifying SimpleMcpServer or adding new controllers.
  • Create new tools by extending the BaseTool abstract class in the services/tools directory.
  • Each tool is self-contained and can be developed independently.
  • Tune performance and security via environment variables in the config directory.

Architecture Diagram

graph LR
  Client --> Server
  Server --> Router["Routes (API/MCP)"]
  Router --> |SSE| SimpleMcpServer
  Router --> |Streamable HTTP| SimpleMcpServer
  Router --> ApiControllers
  SimpleMcpServer --> Controllers
  Controllers --> Tools["Self-Contained Tools"]
  Tools --> |BaseTool| CrawlTool
  Tools --> |BaseTool| SmartCrawlTool
  Tools --> |BaseTool| ExtractLinksTool
  Tools --> |BaseTool| SitemapTool
  Tools --> |BaseTool| SearchInPageTool
  Tools --> |BaseTool| WebSearchTool
  Tools --> |BaseTool| DateTimeTool

References

  • : High-level architecture and conceptual overview.
  • : Detailed explanations of source files.
  • : Endpoint specs and JSON-RPC methods.
  • Model Context Protocol SDK: Official SDK documentation.
  • MCP Transport Models: Details on SSE vs Streamable HTTP.