open-crawler-mcp-server

elchika-inc/open-crawler-mcp-server

3.2

If you are the rightful owner of open-crawler-mcp-server and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

Open Crawler MCP Server is a versatile tool for web crawling and content extraction, supporting multiple output formats and ensuring compliance with web standards.

Tools
  1. crawl_page

    Extracts content from a web page in multiple formats with automatic robots.txt compliance checking.

  2. check_robots

    Validates if a URL is allowed to be crawled according to the site's robots.txt file.

Open Crawler MCP Server

license npm version npm downloads GitHub stars

A Model Context Protocol (MCP) server for web crawling and content extraction from web pages with multiple output formats.

Features

  • Multiple Output Formats: Extract content as text, markdown, structured XML, or JSON
  • Smart Content Extraction: CSS selector support for targeted content extraction
  • Robots.txt Compliance: Automatic robots.txt checking and compliance
  • Rate Limiting: Built-in rate limiting (1 second minimum between requests)
  • Size Protection: Maximum page size limit (10MB) to prevent memory issues
  • Structured Content: Extract headings, paragraphs, links, images, and lists separately
  • Error Handling: Comprehensive error codes for different failure scenarios

MCP Client Configuration

Add this server to your MCP client configuration:

{
  "mcpServers": {
    "open-crawler": {
      "command": "npx",
      "args": ["@elchika-inc/open-crawler-mcp-server"]
    }
  }
}

Available Tools

crawl_page

Extracts content from a web page in multiple formats with automatic robots.txt compliance checking.

Parameters:

  • url (required): Target URL to crawl
  • selector (optional): CSS selector for specific content extraction
  • format (optional): Output format - text, markdown, xml, or json (default: text)
  • text_only (optional): Legacy parameter for text-only extraction (deprecated, use format instead)

Output Formats:

  • text: Clean, plain text content with whitespace normalized
  • markdown: Well-formatted Markdown with headings, links, images, and lists preserved
  • xml: Structured XML with separate sections for headings, paragraphs, links, images, and lists
  • json: Structured JSON object containing categorized content elements

Examples:

Basic text extraction:

{
  "name": "crawl_page",
  "arguments": {
    "url": "https://example.com",
    "format": "text"
  }
}

Markdown extraction with CSS selector:

{
  "name": "crawl_page",
  "arguments": {
    "url": "https://example.com",
    "selector": "article",
    "format": "markdown"
  }
}

Structured JSON extraction:

{
  "name": "crawl_page",
  "arguments": {
    "url": "https://example.com",
    "format": "json"
  }
}

check_robots

Validates if a URL is allowed to be crawled according to the site's robots.txt file.

Parameters:

  • url (required): URL to check for crawling permission

Example:

{
  "name": "check_robots",
  "arguments": {
    "url": "https://example.com/page"
  }
}

Error Handling

Common error scenarios:

  • Network connection issues
  • Invalid HTML or missing content
  • Robots.txt restrictions
  • Request timeouts or rate limits
  • Content size too large (>10MB)

License

MIT