MauricioZapata00/250812-html-reader-mcp-server
If you are the rightful owner of 250812-html-reader-mcp-server and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
A custom MCP server in Rust for reading HTML or text content from web pages, built with Clean Architecture.
HTML API Reader
A high-performance REST API server written in Rust that fetches and extracts content from web pages. Built using Clean Architecture principles with a workspace structure.
Table of Contents
- Quick Start
- Features
- API Endpoints
- Architecture
- Building
- Running
- Docker Setup
- Project Structure
- Development
- Error Handling
Quick Start
🚀 Using Local Development
-
Build and run:
cargo build --release cargo run --bin html-mcp-reader -
Test the API:
# Health check curl http://localhost:8085/health # Fetch web content curl -X POST http://localhost:8085/api/fetch \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com"}'
🐳 Using Docker
-
Build the Docker image:
docker build -t html-api-reader:latest . -
Run the container:
docker run -p 8085:8085 html-api-reader:latest -
Or use Docker Compose:
docker-compose up
Features
- REST API: Simple HTTP endpoints for web content fetching
- HTML Content Extraction: Extract text content from HTML pages
- Flexible Options: Configure text extraction, redirects, timeouts, and user agents
- Clean Architecture: Separated concerns with domain-driven design
- Async/Await: High-performance async processing with Tokio
- CORS Support: Cross-origin requests enabled
- Health Monitoring: Built-in health check endpoint
- Docker Ready: Containerized deployment with health checks
API Endpoints
GET /health
Returns the health status of the API server.
Response:
{
"status": "healthy",
"version": "0.1.0"
}
POST /api/fetch
Fetches and extracts content from web pages.
Request Body:
{
"url": "https://example.com",
"extract_text_only": true,
"follow_redirects": true,
"timeout_seconds": 30,
"user_agent": "html-api-reader/0.1.0"
}
Parameters:
url(required): The URL to fetch content fromextract_text_only(optional, default: true): Whether to extract only text contentfollow_redirects(optional, default: true): Whether to follow HTTP redirectstimeout_seconds(optional, default: 30, max: 300): Request timeout in secondsuser_agent(optional): Custom User-Agent header
Response:
{
"url": "https://example.com",
"title": "Example Domain",
"text_content": "Example Domain This domain is for use in illustrative examples...",
"raw_html": "<!doctype html><html>...",
"metadata": {
"content_type": "text/html; charset=utf-8",
"status_code": 200,
"content_length": 1256,
"last_modified": null,
"charset": null
}
}
Error Response:
{
"error": "INVALID_URL",
"message": "URL cannot be empty"
}
Architecture
The project follows Clean Architecture principles with these layers:
- Domain: Core business logic and interfaces (
domain/) - Application: Use cases and business services (
application/) - Infrastructure: External adapters for HTTP, HTML parsing, and REST API (
infrastructure/) - Runner: Entry point and dependency injection (
runner/)
Dependencies
Key dependencies used:
axum: Modern web framework for the REST APItower-http: HTTP middleware (CORS support)reqwest: HTTP client for fetching web contentscraper: HTML parsing and text extractionserde/serde_json: JSON serialization for API requests/responsestracing: Structured loggingtokio: Async runtime
Building
Local Development
# Build debug version
cargo build
# Build release version (optimized)
cargo build --release
# Run tests
cargo test
# Check code quality
cargo clippy
cargo fmt
Docker
# Build the Docker image
docker build -t html-api-reader:latest .
# Build with Docker Compose
docker-compose build
Running
Local Development
# Run in development mode
cargo run --bin html-mcp-reader
# Run release version
./target/release/html-mcp-reader
# Run with custom port
PORT=9000 cargo run --bin html-mcp-reader
The server will start on http://0.0.0.0:8085 by default.
Docker
# Run with Docker (recommended for production)
docker run -p 8085:8085 html-api-reader:latest
# Run with Docker Compose
docker-compose up
# Run in background
docker-compose up -d
# View logs
docker-compose logs -f html-api-reader
# Stop
docker-compose down
Environment Variables
PORT: Server port (default: 8085)RUST_LOG: Log level (default: info)RUST_BACKTRACE: Enable backtraces (default: 1)
Docker Setup
Prerequisites
- Docker installed on your system
- Docker Compose (usually included with Docker Desktop)
Configuration
The docker-compose.yaml includes:
- Port Mapping:
8085:8085 - Health Check:
curl http://localhost:8085/health - Resource Limits: 512M memory, 0.5 CPU
- Auto-restart:
unless-stopped - Logging: Structured logs with tracing
Testing with Docker
# Build and start
docker-compose up --build
# Test health endpoint
curl http://localhost:8085/health
# Test content fetching
curl -X POST http://localhost:8085/api/fetch \
-H "Content-Type: application/json" \
-d '{"url": "https://httpbin.org/html"}'
Project Structure
├── Cargo.toml # Workspace configuration
├── domain/ # Core business logic
│ ├── src/
│ │ ├── model/ # Domain models (content, request, response)
│ │ └── port/ # Interfaces for external dependencies
├── application/ # Business logic and use cases
│ ├── src/
│ │ ├── service/ # Application services
│ │ └── use_case/ # Use case implementations
├── infrastructure/ # External adapters
│ ├── src/
│ │ ├── client/ # HTTP client implementation
│ │ ├── adapter/ # HTML parser adapter
│ │ └── api/ # REST API server implementation
└── runner/ # Application entry point
└── src/
└── main.rs # Main application with DI setup
Development
Adding New Features
- Define domain models in
domain/src/model/ - Create interfaces in
domain/src/port/ - Implement business logic in
application/src/service/orapplication/src/use_case/ - Create infrastructure adapters in
infrastructure/src/ - Wire dependencies in
runner/src/main.rs
API Development
To add new endpoints:
- Add request/response models to
domain/src/model/request.rs - Implement business logic in
application/src/use_case/ - Add route handlers in
infrastructure/src/api/server.rs - Update the router in
create_router()method
Testing
# Run all tests
cargo test
# Run tests for specific workspace member
cargo test -p domain
cargo test -p application
cargo test -p infrastructure
# Run specific test
cargo test test_name
# Integration tests with running server
cargo run --bin html-mcp-reader &
SERVER_PID=$!
curl http://localhost:8085/health
kill $SERVER_PID
Error Handling
The API returns appropriate HTTP status codes and error responses:
HTTP Status Codes
200 OK: Successful request400 Bad Request: Invalid request parameters500 Internal Server Error: Server-side errors
Error Response Format
{
"error": "ERROR_CODE",
"message": "Human-readable error description"
}
Common Error Codes
INVALID_URL: Empty or malformed URLFETCH_ERROR: Network, timeout, or HTTP errorsPARSE_ERROR: HTML parsing failures
Logging
The application uses structured logging with different levels:
INFO: Normal operation logsERROR: Error conditionsDEBUG: Detailed debugging information
Configure logging with the RUST_LOG environment variable:
# Info level (default)
RUST_LOG=info cargo run
# Debug level for detailed logs
RUST_LOG=debug cargo run
# Module-specific logging
RUST_LOG=infrastructure::api::server=debug cargo run
Performance
- Async/Await: Non-blocking I/O operations
- Connection Pooling: HTTP client reuses connections
- Memory Efficient: Streaming HTML parsing
- Resource Limits: Configurable timeouts and memory limits
- Docker Optimization: Multi-stage build with minimal runtime image
Security
- Non-root User: Docker container runs as non-root user
- Resource Limits: Memory and CPU limits in Docker Compose
- Input Validation: URL validation and parameter sanitization
- Timeout Protection: Configurable request timeouts
- CORS: Cross-origin request support (configurable)
License
This project is open source and available under the .