gofullthrottle/common-crawl-mcp-server
If you are the rightful owner of common-crawl-mcp-server and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.
The Common Crawl MCP Server transforms Common Crawl's extensive web archive into an accessible research platform through AI-powered analysis.
Common Crawl MCP Server
Transform Common Crawl's petabyte-scale web archive into an accessible research platform through AI-powered analysis.
🚀 Features
An epic MCP server that enables:
- Discovery & Metadata - Explore available crawls, search the index, analyze domain statistics
- Data Fetching - Retrieve page content efficiently with smart caching
- Parsing & Analysis - Extract structured data, detect technologies, analyze SEO
- Aggregation & Statistics - Domain-wide reports, link graphs, evolution timelines
- Export & Integration - CSV, JSONL, custom datasets, report generation
- MCP Resources - LLM-accessible data exposure
- MCP Prompts - Guided workflows for complex analysis
📋 Prerequisites
- Python 3.11+
- AWS CLI (optional - for custom S3 configurations)
- Redis (optional - for enhanced caching)
🔧 Installation
Quick Start
# Clone the repository
git clone https://github.com/yourusername/common-crawl-mcp-server.git
cd common-crawl-mcp-server
# Install with uv (recommended)
uv sync
# Or with pip
pip install -e .
Development Setup
# Install with development dependencies
uv sync --extra dev
# Or with pip
pip install -e ".[dev]"
💻 Usage
Starting the Server
# Run with stdio transport (for MCP clients)
python -m src.server
# Or use the installed command
common-crawl-mcp
Configuration
Create a .env file in the project root:
# Cache Configuration
CACHE_DIR=./cache
CACHE_MAX_SIZE_GB=50
# S3 Configuration (uses anonymous access by default)
AWS_REGION=us-east-1
# Redis Configuration (optional)
REDIS_URL=redis://localhost:6379
# Rate Limiting
MAX_CONCURRENT_REQUESTS=5
REQUESTS_PER_SECOND=10
Example Usage with MCP
# Using the MCP client
from mcp import ClientSession
async with ClientSession() as session:
# List available crawls
crawls = await session.call_tool("list_crawls")
# Search for a domain
results = await session.call_tool("search_index", {
"query": "example.com",
"crawl_id": "CC-MAIN-2024-10"
})
# Fetch page content
page = await session.call_tool("fetch_page_content", {
"url": "https://example.com",
"crawl_id": "CC-MAIN-2024-10"
})
🏗️ Architecture
src/
├── server.py # Main FastMCP server
├── config.py # Configuration management
├── core/ # Core infrastructure
│ ├── cc_client.py # CDX Server API client
│ ├── cache.py # Multi-tier caching
│ ├── s3_manager.py # S3 access wrapper
│ └── warc_parser.py # WARC file parsing
├── tools/ # MCP tools
│ ├── discovery.py # Discovery & metadata
│ ├── fetching.py # Data fetching
│ ├── parsing.py # Parsing & analysis
│ ├── aggregation.py # Aggregation & statistics
│ ├── export.py # Export & integration
│ └── advanced.py # Advanced features
├── resources/ # MCP resources
├── prompts/ # MCP prompts
├── models/ # Pydantic models
└── utils/ # Utilities
🔍 Available Tools
Discovery & Metadata
list_crawls()- Get all available Common Crawl crawlsget_crawl_stats(crawl_id)- Statistics for specific crawlsearch_index(query, crawl_id)- Search the CDX indexget_domain_stats(domain, crawl_id)- Domain statisticscompare_crawls(crawl_ids, domain)- Track changes over time
Data Fetching
fetch_page_content(url, crawl_id)- Get page HTML and metadatafetch_warc_records(urls, crawl_id)- Batch fetch WARC recordsbatch_fetch_pages(domain, crawl_id)- Get all pages from domain
Parsing & Analysis
parse_html(content)- Extract structured data from HTMLanalyze_technologies(url, crawl_id)- Detect tech stackextract_links(url, crawl_id)- Link analysisanalyze_seo(url, crawl_id)- SEO auditdetect_language(url, crawl_id)- Language detection
Aggregation & Statistics
domain_technology_report(domain, crawl_id)- Complete tech auditdomain_link_graph(domain, crawl_id)- Internal link structurekeyword_frequency_analysis(urls, keywords)- Keyword analysisheader_analysis(domain, crawl_id)- HTTP header analysis
Export & Integration
export_to_csv(data, fields, filepath)- CSV exportexport_to_jsonl(data, filepath)- JSONL exportcreate_dataset(query, name)- Save reusable datasetsgenerate_report(analysis_results, format)- Generate reports
📚 MCP Resources
Access Common Crawl data through MCP resources:
commoncrawl://crawls- List of all crawlscommoncrawl://crawl/{crawl_id}/stats- Crawl statisticscommoncrawl://domain/{domain}/latest- Latest domain datacommoncrawl://domain/{domain}/timeline- Historical presencecommoncrawl://page/{url}- Page content from latest crawl
🎯 MCP Prompts
Guided workflows for complex analysis:
investigate_domain(domain)- Comprehensive domain investigationcompetitive_analysis(domains)- Multi-domain comparisontechnology_audit(domain)- Technology stack deep diveseo_audit(url)- SEO analysis workflowhistorical_investigation(domain, start_date, end_date)- Temporal analysis
🧪 Testing
# Run all tests
pytest
# Run with coverage
pytest --cov
# Run specific test file
pytest tests/test_core/test_cc_client.py
# Run integration tests
pytest tests/integration/
📊 Use Cases
SEO Professional
"Show me all pages from competitor.com in the last 6 crawls, analyze their title tag patterns, and identify their most linked-to content"
Security Researcher
"Find all WordPress sites from the latest crawl using outdated jQuery versions, export the domains to CSV"
Data Scientist
"Get 10,000 news articles from .com domains, extract structured data, export to JSONL for training"
Business Analyst
"Compare mycompany.com vs competitor1.com vs competitor2.com - technology usage, page count trends, link authority over the past 2 years"
🚦 Performance
- Complete technology report for a domain in <2 minutes
- Cache reduces S3 costs by >80%
- Test coverage >80%
- API response times <500ms (cached), <5s (uncached)
🤝 Contributing
Contributions are welcome! Please read for guidelines.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'feat: add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
📄 License
This project is licensed under the MIT License - see the file for details.
🙏 Acknowledgments
- Built with FastMCP
- Uses Common Crawl public dataset
- WARC parsing by warcio
- HTML parsing by Beautiful Soup
📚 Resources
Status: 🚧 In Development Started: 2025-10-25 Target Completion: 2025-11-15