web-scraper-api

jotape4ai/web-scraper-api

3.2

If you are the rightful owner of web-scraper-api and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

The Web Scraper API is a comprehensive solution for converting websites into structured data, featuring MCP server integration, SDK, and Docker support.

Tools
  1. MCP Server for Cursor

    Allows direct use of the Web Scraper API in Cursor with MCP server integration.

šŸ”„ Web Scraper API

A comprehensive web scraping API with MCP server integration, SDK, and Docker support. Converts any website into clean, structured data.

🌟 Features

  • šŸ•·ļø Individual Scraping: Extract content from a specific URL
  • šŸ”„ Complete Crawling: Crawl all accessible pages of a website
  • šŸ“ Multiple Formats: Markdown, HTML, screenshots
  • ⚔ Asynchronous: Queue system with Redis for background processing
  • šŸ›”ļø Robust: Error handling, rate limiting and validation
  • šŸŽÆ Customizable: Flexible scraping and filtering options

šŸš€ Quick Start

Prerequisites

  • Node.js 18+
  • Redis
  • npm or pnpm

Installation

  1. Clone and setup:
cd web-scraper-api
npm run install:all
  1. Configure environment variables:
cp api/env.example api/.env
# Edit api/.env as needed
  1. Start Redis:
redis-server
  1. Start the server:
npm run start:dev

The API will be available at http://localhost:3002

šŸ“‹ API Endpoints

System Health

# Check that the API is working
curl http://localhost:3002/health

Individual Scraping

# Basic scrape
curl -X POST http://localhost:3002/api/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com"
  }'

# Scrape with options
curl -X POST http://localhost:3002/api/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "options": {
      "includeHtml": true,
      "includeMarkdown": true,
      "includeScreenshot": false,
      "waitFor": 2000,
      "onlyMainContent": true
    }
  }'

Batch Scraping

curl -X POST http://localhost:3002/api/scrape/batch \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://example.com",
      "https://example.com/about",
      "https://example.com/contact"
    ],
    "options": {
      "includeMarkdown": true
    }
  }'

Crawling

# Start crawl
curl -X POST http://localhost:3002/api/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "options": {
      "maxPages": 10,
      "maxDepth": 2,
      "includeSubdomains": false
    }
  }'

# Check crawl status
curl http://localhost:3002/api/crawl/{job-id}

āš™ļø Configuration Options

Scraping Options

OptionTypeDefaultDescription
includeHtmlbooleanfalseInclude raw HTML
includeMarkdownbooleantrueInclude content in Markdown
includeScreenshotbooleanfalseInclude screenshot
waitFornumber0Wait time in ms
timeoutnumber30000Request timeout
userAgentstring-Custom User-Agent
headersobject-Custom HTTP headers
excludeSelectorsstring[]-CSS selectors to exclude
onlyMainContentbooleanfalseOnly main content

Crawling Options

Includes all scraping options plus:

OptionTypeDefaultDescription
maxPagesnumber10Maximum pages to crawl
maxDepthnumber3Maximum depth
allowedDomainsstring[]-Allowed domains
excludePatternsstring[]-URL patterns to exclude
includeSubdomainsbooleanfalseInclude subdomains
respectRobotsTxtbooleanfalseRespect robots.txt

šŸ—ļø Architecture

web-scraper-api/
ā”œā”€ā”€ api/                    # API Server
│   ā”œā”€ā”€ src/
│   │   ā”œā”€ā”€ routes/        # HTTP routes
│   │   ā”œā”€ā”€ services/      # Business logic
│   │   ā”œā”€ā”€ types/         # Type definitions
│   │   ā”œā”€ā”€ utils/         # Utilities
│   │   └── workers/       # Queue workers
│   └── package.json
ā”œā”€ā”€ sdk/                   # TypeScript SDK
│   ā”œā”€ā”€ src/               # API client
│   └── package.json
ā”œā”€ā”€ mcp-server/            # MCP Server for Cursor
│   ā”œā”€ā”€ src/               # MCP server
│   └── package.json
ā”œā”€ā”€ examples/              # Usage examples
ā”œā”€ā”€ docs/                  # Documentation
ā”œā”€ā”€ install-mcp.sh         # MCP installation script
└── start-mcp-server.sh    # MCP startup script

šŸ› ļø Development

Available Scripts

# Install all dependencies
npm run install:all

# Development (with hot reload)
npm run start:dev

# Production
npm run build
npm start

# Tests
npm test

# Start workers
cd api && npm run workers

Development Configuration

The api/.env file should include:

PORT=3002
HOST=0.0.0.0
REDIS_URL=redis://localhost:6379
NUM_WORKERS=4
PUPPETEER_HEADLESS=true
LOG_LEVEL=info

šŸ¤– MCP Server for Cursor

Now you can use Web Scraper API directly in Cursor with our MCP server!

Quick Installation

# Install everything automatically
./install-mcp.sh

# Start server
./start-mcp-server.sh

Cursor Configuration

Add this to your Cursor settings.json:

{
  "mcp.servers": {
    "web-scraper-api": {
      "command": "node",
      "args": ["/full/path/to/web-scraper-api/mcp-server/dist/index.js"],
      "env": {
        "WEB_SCRAPER_API_URL": "http://localhost:3002"
      }
    }
  }
}

Usage in Cursor

Once configured, you can use commands like:

Complete documentation:

šŸš€ Deployment

With Docker

# Coming soon - Docker Compose
docker-compose up

Manual

  1. Configure Redis in production
  2. Set environment variables
  3. Build the project: npm run build
  4. Start server: npm start
  5. Start workers: npm run workers

šŸ”§ Technologies

  • Backend: Node.js, TypeScript, Express
  • Web Scraping: Puppeteer, Cheerio
  • Queues: BullMQ + Redis
  • Processing: TurndownService (HTML → Markdown)
  • Security: Helmet, CORS, Rate Limiting

šŸ“ Roadmap

  • JavaScript/TypeScript SDK
  • Python SDK
  • Web administration interface
  • Authentication support
  • Webhook notifications
  • Docker containers
  • Metrics and monitoring
  • Smart caching
  • Proxy support

šŸ¤ Contributing

  1. Fork the project
  2. Create a feature branch (git checkout -b feature/new-feature)
  3. Commit changes (git commit -am 'Add new feature')
  4. Push to the branch (git push origin feature/new-feature)
  5. Create Pull Request

šŸ“„ License

MIT License - see for more details.

šŸ™ Inspiration

This project is inspired by Firecrawl and serves as a simplified proof of concept to understand the concepts of web scraping at scale.