web-scraper-api

jotape4ai/web-scraper-api

3.1

If you are the rightful owner of web-scraper-api and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.

The Web Scraper API is a comprehensive solution for converting websites into structured data, featuring MCP server integration, SDK, and Docker support.

Tools
1
Resources
0
Prompts
0

🔥 Web Scraper API

A comprehensive web scraping API with MCP server integration, SDK, and Docker support. Converts any website into clean, structured data.

🌟 Features

  • 🕷️ Individual Scraping: Extract content from a specific URL
  • 🔄 Complete Crawling: Crawl all accessible pages of a website
  • 📝 Multiple Formats: Markdown, HTML, screenshots
  • ⚡ Asynchronous: Queue system with Redis for background processing
  • 🛡️ Robust: Error handling, rate limiting and validation
  • 🎯 Customizable: Flexible scraping and filtering options

🚀 Quick Start

Prerequisites

  • Node.js 18+
  • Redis
  • npm or pnpm

Installation

  1. Clone and setup:
cd web-scraper-api
npm run install:all
  1. Configure environment variables:
cp api/env.example api/.env
# Edit api/.env as needed
  1. Start Redis:
redis-server
  1. Start the server:
npm run start:dev

The API will be available at http://localhost:3002

📋 API Endpoints

System Health

# Check that the API is working
curl http://localhost:3002/health

Individual Scraping

# Basic scrape
curl -X POST http://localhost:3002/api/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com"
  }'

# Scrape with options
curl -X POST http://localhost:3002/api/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "options": {
      "includeHtml": true,
      "includeMarkdown": true,
      "includeScreenshot": false,
      "waitFor": 2000,
      "onlyMainContent": true
    }
  }'

Batch Scraping

curl -X POST http://localhost:3002/api/scrape/batch \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://example.com",
      "https://example.com/about",
      "https://example.com/contact"
    ],
    "options": {
      "includeMarkdown": true
    }
  }'

Crawling

# Start crawl
curl -X POST http://localhost:3002/api/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "options": {
      "maxPages": 10,
      "maxDepth": 2,
      "includeSubdomains": false
    }
  }'

# Check crawl status
curl http://localhost:3002/api/crawl/{job-id}

⚙️ Configuration Options

Scraping Options

OptionTypeDefaultDescription
includeHtmlbooleanfalseInclude raw HTML
includeMarkdownbooleantrueInclude content in Markdown
includeScreenshotbooleanfalseInclude screenshot
waitFornumber0Wait time in ms
timeoutnumber30000Request timeout
userAgentstring-Custom User-Agent
headersobject-Custom HTTP headers
excludeSelectorsstring[]-CSS selectors to exclude
onlyMainContentbooleanfalseOnly main content

Crawling Options

Includes all scraping options plus:

OptionTypeDefaultDescription
maxPagesnumber10Maximum pages to crawl
maxDepthnumber3Maximum depth
allowedDomainsstring[]-Allowed domains
excludePatternsstring[]-URL patterns to exclude
includeSubdomainsbooleanfalseInclude subdomains
respectRobotsTxtbooleanfalseRespect robots.txt

🏗️ Architecture

web-scraper-api/
├── api/                    # API Server
│   ├── src/
│   │   ├── routes/        # HTTP routes
│   │   ├── services/      # Business logic
│   │   ├── types/         # Type definitions
│   │   ├── utils/         # Utilities
│   │   └── workers/       # Queue workers
│   └── package.json
├── sdk/                   # TypeScript SDK
│   ├── src/               # API client
│   └── package.json
├── mcp-server/            # MCP Server for Cursor
│   ├── src/               # MCP server
│   └── package.json
├── examples/              # Usage examples
├── docs/                  # Documentation
├── install-mcp.sh         # MCP installation script
└── start-mcp-server.sh    # MCP startup script

🛠️ Development

Available Scripts

# Install all dependencies
npm run install:all

# Development (with hot reload)
npm run start:dev

# Production
npm run build
npm start

# Tests
npm test

# Start workers
cd api && npm run workers

Development Configuration

The api/.env file should include:

PORT=3002
HOST=0.0.0.0
REDIS_URL=redis://localhost:6379
NUM_WORKERS=4
PUPPETEER_HEADLESS=true
LOG_LEVEL=info

🤖 MCP Server for Cursor

Now you can use Web Scraper API directly in Cursor with our MCP server!

Quick Installation

# Install everything automatically
./install-mcp.sh

# Start server
./start-mcp-server.sh

Cursor Configuration

Add this to your Cursor settings.json:

{
  "mcp.servers": {
    "web-scraper-api": {
      "command": "node",
      "args": ["/full/path/to/web-scraper-api/mcp-server/dist/index.js"],
      "env": {
        "WEB_SCRAPER_API_URL": "http://localhost:3002"
      }
    }
  }
}

Usage in Cursor

Once configured, you can use commands like:

Complete documentation:

🚀 Deployment

With Docker

# Coming soon - Docker Compose
docker-compose up

Manual

  1. Configure Redis in production
  2. Set environment variables
  3. Build the project: npm run build
  4. Start server: npm start
  5. Start workers: npm run workers

🔧 Technologies

  • Backend: Node.js, TypeScript, Express
  • Web Scraping: Puppeteer, Cheerio
  • Queues: BullMQ + Redis
  • Processing: TurndownService (HTML → Markdown)
  • Security: Helmet, CORS, Rate Limiting

📝 Roadmap

  • JavaScript/TypeScript SDK
  • Python SDK
  • Web administration interface
  • Authentication support
  • Webhook notifications
  • Docker containers
  • Metrics and monitoring
  • Smart caching
  • Proxy support

🤝 Contributing

  1. Fork the project
  2. Create a feature branch (git checkout -b feature/new-feature)
  3. Commit changes (git commit -am 'Add new feature')
  4. Push to the branch (git push origin feature/new-feature)
  5. Create Pull Request

📄 License

MIT License - see for more details.

🙏 Inspiration

This project is inspired by Firecrawl and serves as a simplified proof of concept to understand the concepts of web scraping at scale.