docs-scraper-mcp

adwonnacott/docs-scraper-mcp

3.2

If you are the rightful owner of docs-scraper-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.

An MCP server for scraping documentation with Firecrawl and storing in GitHub.

docs-scraper-mcp

A production-quality MCP server for scraping documentation with Firecrawl and storing in GitHub.

Features

Scraping

  • scrape_docs - Scrape any documentation URL with URL pattern filtering
  • scrape_spa - Scrape JavaScript-heavy sites (React, Vue, Angular, Stoplight) with wait times and actions

Retrieval

  • list_docs - List all scraped documentation sources with stats
  • get_doc - Retrieve content from scraped documentation
  • search_docs - Full-text search across all scraped docs with ranked results
  • docs_stats - Get statistics about your documentation library

Management

  • delete_docs - Delete docs for a specific domain
  • delete_all_docs - Delete all scraped documentation

Additional Features

  • Tagging - Categorize docs with tags for easy filtering
  • Version tracking - Track scrape versions for each domain
  • Word count - Automatic word counting for all scraped pages
  • URL filtering - Include/exclude patterns for targeted scraping
  • Retry logic - Automatic retries with exponential backoff
  • GitHub backup - All docs backed up to GitHub automatically

Setup

1. Clone and build

git clone https://github.com/adwonnacott/docs-scraper-mcp.git ~/docs-scraper-mcp
cd ~/docs-scraper-mcp
npm install
npm run build

2. Get your tokens

3. Create a GitHub repo for storing docs

gh repo create scraped-docs --public
# IMPORTANT: Add a README to initialize the repo
echo "# Scraped Documentation" > /tmp/README.md
gh repo clone yourusername/scraped-docs /tmp/scraped-docs
cp /tmp/README.md /tmp/scraped-docs/
cd /tmp/scraped-docs && git add . && git commit -m "Initial commit" && git push

4. Configure Claude Code

Add to ~/.claude.json:

{
  "mcpServers": {
    "docs-scraper": {
      "command": "node",
      "args": ["~/docs-scraper-mcp/dist/index.js"],
      "env": {
        "FIRECRAWL_API_KEY": "your-firecrawl-key",
        "GITHUB_TOKEN": "your-github-token",
        "GITHUB_REPO": "yourusername/scraped-docs"
      }
    }
  }
}

5. Restart Claude Code

The tools will be available after restart.

Usage Examples

Basic scraping

"Scrape the docs at https://docs.stripe.com/api"

Scraping with filters

"Scrape https://docs.example.com but only include /api/* paths and exclude /blog/*"

Scraping SPAs

"Use scrape_spa on https://developer.timecamp.com/ with 10 second wait time"

Searching

"Search my docs for 'authentication'"
"Search for 'webhook' in the stripe docs only"

Tagging

"Scrape https://docs.example.com with tags ['api', 'payments']"

Getting stats

"Show me my docs stats"

Local Storage

Scraped docs are saved to ~/scraped-docs/ organized by domain:

~/scraped-docs/
├── index.json                    # Master index
├── docs-stripe-com/
│   ├── _metadata.json           # Domain metadata
│   ├── index.md
│   ├── api_authentication.md
│   └── ...
└── developer-timecamp-com/
    ├── _metadata.json
    └── ...

Architecture

src/
├── index.ts           # MCP server entry point, tool registration
├── types.ts           # Shared TypeScript types
├── services/
│   ├── firecrawl.ts   # Firecrawl API client with retry logic
│   └── github.ts      # GitHub API client for backup
└── tools/
    ├── scrape.ts      # Scraping logic
    ├── list.ts        # List docs
    ├── get.ts         # Get doc content
    ├── search.ts      # Full-text search
    └── delete.ts      # Delete docs

Troubleshooting

"Git Repository is empty" error

Initialize your GitHub repo with at least one commit (a README).

Only 2 pages scraped from SPA

Some sites like Stoplight load content via JavaScript. Use scrape_spa with a higher waitFor value. Note: if the site is a true single-page app with no real URLs, you may only get 1-2 pages regardless.

Rate limited

The Firecrawl free tier has limits. Upgrade or wait for the limit to reset.

License

MIT