elchika-inc/open-crawler-mcp-server
If you are the rightful owner of open-crawler-mcp-server and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
Open Crawler MCP Server is a versatile tool for web crawling and content extraction, supporting multiple output formats and ensuring compliance with web standards.
crawl_page
Extracts content from a web page in multiple formats with automatic robots.txt compliance checking.
check_robots
Validates if a URL is allowed to be crawled according to the site's robots.txt file.
Open Crawler MCP Server
A Model Context Protocol (MCP) server for web crawling and content extraction from web pages with multiple output formats.
Features
- Multiple Output Formats: Extract content as text, markdown, structured XML, or JSON
- Smart Content Extraction: CSS selector support for targeted content extraction
- Robots.txt Compliance: Automatic robots.txt checking and compliance
- Rate Limiting: Built-in rate limiting (1 second minimum between requests)
- Size Protection: Maximum page size limit (10MB) to prevent memory issues
- Structured Content: Extract headings, paragraphs, links, images, and lists separately
- Error Handling: Comprehensive error codes for different failure scenarios
MCP Client Configuration
Add this server to your MCP client configuration:
{
"mcpServers": {
"open-crawler": {
"command": "npx",
"args": ["@elchika-inc/open-crawler-mcp-server"]
}
}
}
Available Tools
crawl_page
Extracts content from a web page in multiple formats with automatic robots.txt compliance checking.
Parameters:
url
(required): Target URL to crawlselector
(optional): CSS selector for specific content extractionformat
(optional): Output format -text
,markdown
,xml
, orjson
(default:text
)text_only
(optional): Legacy parameter for text-only extraction (deprecated, useformat
instead)
Output Formats:
text
: Clean, plain text content with whitespace normalizedmarkdown
: Well-formatted Markdown with headings, links, images, and lists preservedxml
: Structured XML with separate sections for headings, paragraphs, links, images, and listsjson
: Structured JSON object containing categorized content elements
Examples:
Basic text extraction:
{
"name": "crawl_page",
"arguments": {
"url": "https://example.com",
"format": "text"
}
}
Markdown extraction with CSS selector:
{
"name": "crawl_page",
"arguments": {
"url": "https://example.com",
"selector": "article",
"format": "markdown"
}
}
Structured JSON extraction:
{
"name": "crawl_page",
"arguments": {
"url": "https://example.com",
"format": "json"
}
}
check_robots
Validates if a URL is allowed to be crawled according to the site's robots.txt file.
Parameters:
url
(required): URL to check for crawling permission
Example:
{
"name": "check_robots",
"arguments": {
"url": "https://example.com/page"
}
}
Error Handling
Common error scenarios:
- Network connection issues
- Invalid HTML or missing content
- Robots.txt restrictions
- Request timeouts or rate limits
- Content size too large (>10MB)
License
MIT