jotape4ai/web-scraper-api
If you are the rightful owner of web-scraper-api and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.
The Web Scraper API is a comprehensive solution for converting websites into structured data, featuring MCP server integration, SDK, and Docker support.
🔥 Web Scraper API
A comprehensive web scraping API with MCP server integration, SDK, and Docker support. Converts any website into clean, structured data.
🌟 Features
- 🕷️ Individual Scraping: Extract content from a specific URL
- 🔄 Complete Crawling: Crawl all accessible pages of a website
- 📝 Multiple Formats: Markdown, HTML, screenshots
- ⚡ Asynchronous: Queue system with Redis for background processing
- 🛡️ Robust: Error handling, rate limiting and validation
- 🎯 Customizable: Flexible scraping and filtering options
🚀 Quick Start
Prerequisites
- Node.js 18+
- Redis
- npm or pnpm
Installation
- Clone and setup:
cd web-scraper-api
npm run install:all
- Configure environment variables:
cp api/env.example api/.env
# Edit api/.env as needed
- Start Redis:
redis-server
- Start the server:
npm run start:dev
The API will be available at http://localhost:3002
📋 API Endpoints
System Health
# Check that the API is working
curl http://localhost:3002/health
Individual Scraping
# Basic scrape
curl -X POST http://localhost:3002/api/scrape \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com"
}'
# Scrape with options
curl -X POST http://localhost:3002/api/scrape \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"options": {
"includeHtml": true,
"includeMarkdown": true,
"includeScreenshot": false,
"waitFor": 2000,
"onlyMainContent": true
}
}'
Batch Scraping
curl -X POST http://localhost:3002/api/scrape/batch \
-H "Content-Type: application/json" \
-d '{
"urls": [
"https://example.com",
"https://example.com/about",
"https://example.com/contact"
],
"options": {
"includeMarkdown": true
}
}'
Crawling
# Start crawl
curl -X POST http://localhost:3002/api/crawl \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"options": {
"maxPages": 10,
"maxDepth": 2,
"includeSubdomains": false
}
}'
# Check crawl status
curl http://localhost:3002/api/crawl/{job-id}
⚙️ Configuration Options
Scraping Options
| Option | Type | Default | Description |
|---|---|---|---|
includeHtml | boolean | false | Include raw HTML |
includeMarkdown | boolean | true | Include content in Markdown |
includeScreenshot | boolean | false | Include screenshot |
waitFor | number | 0 | Wait time in ms |
timeout | number | 30000 | Request timeout |
userAgent | string | - | Custom User-Agent |
headers | object | - | Custom HTTP headers |
excludeSelectors | string[] | - | CSS selectors to exclude |
onlyMainContent | boolean | false | Only main content |
Crawling Options
Includes all scraping options plus:
| Option | Type | Default | Description |
|---|---|---|---|
maxPages | number | 10 | Maximum pages to crawl |
maxDepth | number | 3 | Maximum depth |
allowedDomains | string[] | - | Allowed domains |
excludePatterns | string[] | - | URL patterns to exclude |
includeSubdomains | boolean | false | Include subdomains |
respectRobotsTxt | boolean | false | Respect robots.txt |
🏗️ Architecture
web-scraper-api/
├── api/ # API Server
│ ├── src/
│ │ ├── routes/ # HTTP routes
│ │ ├── services/ # Business logic
│ │ ├── types/ # Type definitions
│ │ ├── utils/ # Utilities
│ │ └── workers/ # Queue workers
│ └── package.json
├── sdk/ # TypeScript SDK
│ ├── src/ # API client
│ └── package.json
├── mcp-server/ # MCP Server for Cursor
│ ├── src/ # MCP server
│ └── package.json
├── examples/ # Usage examples
├── docs/ # Documentation
├── install-mcp.sh # MCP installation script
└── start-mcp-server.sh # MCP startup script
🛠️ Development
Available Scripts
# Install all dependencies
npm run install:all
# Development (with hot reload)
npm run start:dev
# Production
npm run build
npm start
# Tests
npm test
# Start workers
cd api && npm run workers
Development Configuration
The api/.env file should include:
PORT=3002
HOST=0.0.0.0
REDIS_URL=redis://localhost:6379
NUM_WORKERS=4
PUPPETEER_HEADLESS=true
LOG_LEVEL=info
🤖 MCP Server for Cursor
Now you can use Web Scraper API directly in Cursor with our MCP server!
Quick Installation
# Install everything automatically
./install-mcp.sh
# Start server
./start-mcp-server.sh
Cursor Configuration
Add this to your Cursor settings.json:
{
"mcp.servers": {
"web-scraper-api": {
"command": "node",
"args": ["/full/path/to/web-scraper-api/mcp-server/dist/index.js"],
"env": {
"WEB_SCRAPER_API_URL": "http://localhost:3002"
}
}
}
}
Usage in Cursor
Once configured, you can use commands like:
- "Extract content from https://example.com"
- "Scrape these URLs in batch: [urls]"
- "Crawl the entire website https://blog.example.com"
Complete documentation:
🚀 Deployment
With Docker
# Coming soon - Docker Compose
docker-compose up
Manual
- Configure Redis in production
- Set environment variables
- Build the project:
npm run build - Start server:
npm start - Start workers:
npm run workers
🔧 Technologies
- Backend: Node.js, TypeScript, Express
- Web Scraping: Puppeteer, Cheerio
- Queues: BullMQ + Redis
- Processing: TurndownService (HTML → Markdown)
- Security: Helmet, CORS, Rate Limiting
📝 Roadmap
- JavaScript/TypeScript SDK
- Python SDK
- Web administration interface
- Authentication support
- Webhook notifications
- Docker containers
- Metrics and monitoring
- Smart caching
- Proxy support
🤝 Contributing
- Fork the project
- Create a feature branch (
git checkout -b feature/new-feature) - Commit changes (
git commit -am 'Add new feature') - Push to the branch (
git push origin feature/new-feature) - Create Pull Request
📄 License
MIT License - see for more details.
🙏 Inspiration
This project is inspired by Firecrawl and serves as a simplified proof of concept to understand the concepts of web scraping at scale.