adwonnacott/docs-scraper-mcp
If you are the rightful owner of docs-scraper-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.
An MCP server for scraping documentation with Firecrawl and storing in GitHub.
docs-scraper-mcp
A production-quality MCP server for scraping documentation with Firecrawl and storing in GitHub.
Features
Scraping
- scrape_docs - Scrape any documentation URL with URL pattern filtering
- scrape_spa - Scrape JavaScript-heavy sites (React, Vue, Angular, Stoplight) with wait times and actions
Retrieval
- list_docs - List all scraped documentation sources with stats
- get_doc - Retrieve content from scraped documentation
- search_docs - Full-text search across all scraped docs with ranked results
- docs_stats - Get statistics about your documentation library
Management
- delete_docs - Delete docs for a specific domain
- delete_all_docs - Delete all scraped documentation
Additional Features
- Tagging - Categorize docs with tags for easy filtering
- Version tracking - Track scrape versions for each domain
- Word count - Automatic word counting for all scraped pages
- URL filtering - Include/exclude patterns for targeted scraping
- Retry logic - Automatic retries with exponential backoff
- GitHub backup - All docs backed up to GitHub automatically
Setup
1. Clone and build
git clone https://github.com/adwonnacott/docs-scraper-mcp.git ~/docs-scraper-mcp
cd ~/docs-scraper-mcp
npm install
npm run build
2. Get your tokens
- Firecrawl API key: Sign up at https://firecrawl.dev
- GitHub token: Run
gh auth loginthengh auth token, or create a PAT at https://github.com/settings/tokens
3. Create a GitHub repo for storing docs
gh repo create scraped-docs --public
# IMPORTANT: Add a README to initialize the repo
echo "# Scraped Documentation" > /tmp/README.md
gh repo clone yourusername/scraped-docs /tmp/scraped-docs
cp /tmp/README.md /tmp/scraped-docs/
cd /tmp/scraped-docs && git add . && git commit -m "Initial commit" && git push
4. Configure Claude Code
Add to ~/.claude.json:
{
"mcpServers": {
"docs-scraper": {
"command": "node",
"args": ["~/docs-scraper-mcp/dist/index.js"],
"env": {
"FIRECRAWL_API_KEY": "your-firecrawl-key",
"GITHUB_TOKEN": "your-github-token",
"GITHUB_REPO": "yourusername/scraped-docs"
}
}
}
}
5. Restart Claude Code
The tools will be available after restart.
Usage Examples
Basic scraping
"Scrape the docs at https://docs.stripe.com/api"
Scraping with filters
"Scrape https://docs.example.com but only include /api/* paths and exclude /blog/*"
Scraping SPAs
"Use scrape_spa on https://developer.timecamp.com/ with 10 second wait time"
Searching
"Search my docs for 'authentication'"
"Search for 'webhook' in the stripe docs only"
Tagging
"Scrape https://docs.example.com with tags ['api', 'payments']"
Getting stats
"Show me my docs stats"
Local Storage
Scraped docs are saved to ~/scraped-docs/ organized by domain:
~/scraped-docs/
├── index.json # Master index
├── docs-stripe-com/
│ ├── _metadata.json # Domain metadata
│ ├── index.md
│ ├── api_authentication.md
│ └── ...
└── developer-timecamp-com/
├── _metadata.json
└── ...
Architecture
src/
├── index.ts # MCP server entry point, tool registration
├── types.ts # Shared TypeScript types
├── services/
│ ├── firecrawl.ts # Firecrawl API client with retry logic
│ └── github.ts # GitHub API client for backup
└── tools/
├── scrape.ts # Scraping logic
├── list.ts # List docs
├── get.ts # Get doc content
├── search.ts # Full-text search
└── delete.ts # Delete docs
Troubleshooting
"Git Repository is empty" error
Initialize your GitHub repo with at least one commit (a README).
Only 2 pages scraped from SPA
Some sites like Stoplight load content via JavaScript. Use scrape_spa with a higher waitFor value. Note: if the site is a true single-page app with no real URLs, you may only get 1-2 pages regardless.
Rate limited
The Firecrawl free tier has limits. Upgrade or wait for the limit to reset.
License
MIT