jotape4ai/web-scraper-api
If you are the rightful owner of web-scraper-api and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
The Web Scraper API is a comprehensive solution for converting websites into structured data, featuring MCP server integration, SDK, and Docker support.
MCP Server for Cursor
Allows direct use of the Web Scraper API in Cursor with MCP server integration.
š„ Web Scraper API
A comprehensive web scraping API with MCP server integration, SDK, and Docker support. Converts any website into clean, structured data.
š Features
- š·ļø Individual Scraping: Extract content from a specific URL
- š Complete Crawling: Crawl all accessible pages of a website
- š Multiple Formats: Markdown, HTML, screenshots
- ā” Asynchronous: Queue system with Redis for background processing
- š”ļø Robust: Error handling, rate limiting and validation
- šÆ Customizable: Flexible scraping and filtering options
š Quick Start
Prerequisites
- Node.js 18+
- Redis
- npm or pnpm
Installation
- Clone and setup:
cd web-scraper-api
npm run install:all
- Configure environment variables:
cp api/env.example api/.env
# Edit api/.env as needed
- Start Redis:
redis-server
- Start the server:
npm run start:dev
The API will be available at http://localhost:3002
š API Endpoints
System Health
# Check that the API is working
curl http://localhost:3002/health
Individual Scraping
# Basic scrape
curl -X POST http://localhost:3002/api/scrape \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com"
}'
# Scrape with options
curl -X POST http://localhost:3002/api/scrape \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"options": {
"includeHtml": true,
"includeMarkdown": true,
"includeScreenshot": false,
"waitFor": 2000,
"onlyMainContent": true
}
}'
Batch Scraping
curl -X POST http://localhost:3002/api/scrape/batch \
-H "Content-Type: application/json" \
-d '{
"urls": [
"https://example.com",
"https://example.com/about",
"https://example.com/contact"
],
"options": {
"includeMarkdown": true
}
}'
Crawling
# Start crawl
curl -X POST http://localhost:3002/api/crawl \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"options": {
"maxPages": 10,
"maxDepth": 2,
"includeSubdomains": false
}
}'
# Check crawl status
curl http://localhost:3002/api/crawl/{job-id}
āļø Configuration Options
Scraping Options
Option | Type | Default | Description |
---|---|---|---|
includeHtml | boolean | false | Include raw HTML |
includeMarkdown | boolean | true | Include content in Markdown |
includeScreenshot | boolean | false | Include screenshot |
waitFor | number | 0 | Wait time in ms |
timeout | number | 30000 | Request timeout |
userAgent | string | - | Custom User-Agent |
headers | object | - | Custom HTTP headers |
excludeSelectors | string[] | - | CSS selectors to exclude |
onlyMainContent | boolean | false | Only main content |
Crawling Options
Includes all scraping options plus:
Option | Type | Default | Description |
---|---|---|---|
maxPages | number | 10 | Maximum pages to crawl |
maxDepth | number | 3 | Maximum depth |
allowedDomains | string[] | - | Allowed domains |
excludePatterns | string[] | - | URL patterns to exclude |
includeSubdomains | boolean | false | Include subdomains |
respectRobotsTxt | boolean | false | Respect robots.txt |
šļø Architecture
web-scraper-api/
āāā api/ # API Server
ā āāā src/
ā ā āāā routes/ # HTTP routes
ā ā āāā services/ # Business logic
ā ā āāā types/ # Type definitions
ā ā āāā utils/ # Utilities
ā ā āāā workers/ # Queue workers
ā āāā package.json
āāā sdk/ # TypeScript SDK
ā āāā src/ # API client
ā āāā package.json
āāā mcp-server/ # MCP Server for Cursor
ā āāā src/ # MCP server
ā āāā package.json
āāā examples/ # Usage examples
āāā docs/ # Documentation
āāā install-mcp.sh # MCP installation script
āāā start-mcp-server.sh # MCP startup script
š ļø Development
Available Scripts
# Install all dependencies
npm run install:all
# Development (with hot reload)
npm run start:dev
# Production
npm run build
npm start
# Tests
npm test
# Start workers
cd api && npm run workers
Development Configuration
The api/.env
file should include:
PORT=3002
HOST=0.0.0.0
REDIS_URL=redis://localhost:6379
NUM_WORKERS=4
PUPPETEER_HEADLESS=true
LOG_LEVEL=info
š¤ MCP Server for Cursor
Now you can use Web Scraper API directly in Cursor with our MCP server!
Quick Installation
# Install everything automatically
./install-mcp.sh
# Start server
./start-mcp-server.sh
Cursor Configuration
Add this to your Cursor settings.json
:
{
"mcp.servers": {
"web-scraper-api": {
"command": "node",
"args": ["/full/path/to/web-scraper-api/mcp-server/dist/index.js"],
"env": {
"WEB_SCRAPER_API_URL": "http://localhost:3002"
}
}
}
}
Usage in Cursor
Once configured, you can use commands like:
- "Extract content from https://example.com"
- "Scrape these URLs in batch: [urls]"
- "Crawl the entire website https://blog.example.com"
Complete documentation:
š Deployment
With Docker
# Coming soon - Docker Compose
docker-compose up
Manual
- Configure Redis in production
- Set environment variables
- Build the project:
npm run build
- Start server:
npm start
- Start workers:
npm run workers
š§ Technologies
- Backend: Node.js, TypeScript, Express
- Web Scraping: Puppeteer, Cheerio
- Queues: BullMQ + Redis
- Processing: TurndownService (HTML ā Markdown)
- Security: Helmet, CORS, Rate Limiting
š Roadmap
- JavaScript/TypeScript SDK
- Python SDK
- Web administration interface
- Authentication support
- Webhook notifications
- Docker containers
- Metrics and monitoring
- Smart caching
- Proxy support
š¤ Contributing
- Fork the project
- Create a feature branch (
git checkout -b feature/new-feature
) - Commit changes (
git commit -am 'Add new feature'
) - Push to the branch (
git push origin feature/new-feature
) - Create Pull Request
š License
MIT License - see for more details.
š Inspiration
This project is inspired by Firecrawl and serves as a simplified proof of concept to understand the concepts of web scraping at scale.