cleanweb-mcp

guangxiangdebizi/cleanweb-mcp

3.1

If you are the rightful owner of cleanweb-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

CleanWeb MCP is a lightweight Model Context Protocol server designed to extract and clean web content, converting it into a clean Markdown format.

๐ŸŒ CleanWeb MCP

npm version GitHub stars License: MIT

A lightweight Model Context Protocol (MCP) server

Specialized in intelligently extracting core web content, automatically filtering ads and irrelevant elements, and converting to clean Markdown format

๐Ÿš€ Quick Start โ€ข ๐Ÿ“– Documentation โ€ข ๐Ÿ”ง Configuration โ€ข ๐Ÿค Contributing

โœจ Features

๐ŸŒ Smart Extraction๐Ÿงน Content Cleaning๐Ÿ“ Format Conversionโšก Lightweight Deploy
Axios + Cheerio + ReadabilityAuto-filter ads & distractionsHTML โ†’ MarkdownZero browser dependency

๐ŸŽฏ Core Advantages

  • ๐ŸŒ Smart Content Extraction: Uses Axios + Cheerio + Readability algorithm to extract main web content
  • ๐Ÿงน Intelligent Content Cleaning: Automatically removes ads, navigation, sidebars and other distracting elements
  • ๐Ÿ“ Markdown Conversion: Converts HTML content to clean Markdown format
  • ๐Ÿ–ผ๏ธ Image Link Optimization: Automatically handles overly long image links for better readability
  • โšก Lightweight Deployment: No browser dependencies, simple and fast deployment
  • ๐Ÿ”ง Multiple Output Formats: Supports pure Markdown or JSON format with metadata
  • ๐Ÿš€ MCP Protocol: Fully compatible with Model Context Protocol standard

๐Ÿ› ๏ธ Tech Stack

TypeScript Node.js Axios Cheerio

๐Ÿš€ Quick Start

๐Ÿ“ฆ Installation

# Install from npm
npm install cleanweb-mcp

# Or clone the repository
git clone https://github.com/guangxiangdebizi/cleanweb-mcp.git
cd cleanweb-mcp
npm install

๐Ÿ’ก Advantage: Uses lightweight HTTP client, no browser download required, simpler deployment! Focused on content cleaning and optimization.

๐Ÿ”ง Build Project

npm run build

๐ŸŽฏ Usage

1. Stdio Mode (Local Development)

npm run mcp:stdio

2. SSE Mode (via Supergateway)

npm run mcp:sse

Server will start at http://localhost:3100/sse

3. WebSocket Mode

npm run mcp:ws

4. Development Mode (Watch file changes)

npm run mcp:dev

๐Ÿ› ๏ธ Claude Configuration

Stdio Mode Configuration

Add to Claude's configuration file:

{
  "mcpServers": {
    "cleanweb-mcp": {
      "command": "node",
      "args": ["path/to/your/project/build/index.js"]
    }
  }
}

SSE Mode Configuration

{
  "mcpServers": {
    "cleanweb-mcp-sse": {
      "type": "sse",
      "url": "http://localhost:3100/sse",
      "timeout": 600
    }
  }
}

๐Ÿ”จ API Reference

extract_web_content

Intelligently extract web content and convert to Markdown format.

Parameters
ParameterTypeRequiredDefaultDescription
urlstringโœ…-The web URL to extract content from
formatstringโŒmarkdownReturn format: markdown or json
timeoutnumberโŒ30000Page loading timeout (milliseconds)
Usage Examples
// Basic usage
extract_web_content({
  url: "https://example.com/article"
})

// Advanced usage
extract_web_content({
  url: "https://example.com/article",
  format: "json",
  timeout: 60000
})

๐Ÿ“ Project Structure

cleanweb-mcp/
โ”œโ”€โ”€ ๐Ÿ“„ README.md                 # Project documentation
โ”œโ”€โ”€ ๐Ÿ“ฆ package.json              # Project configuration
โ”œโ”€โ”€ โš™๏ธ tsconfig.json             # TypeScript configuration
โ”œโ”€โ”€ ๐Ÿ”ง claude-config-example.json # Claude configuration example
โ”œโ”€โ”€ ๐Ÿ“– example-usage.md          # Usage examples
โ”œโ”€โ”€ ๐Ÿ—๏ธ build/                    # Compiled output
โ”‚   โ”œโ”€โ”€ index.js
โ”‚   โ””โ”€โ”€ tools/
โ”‚       โ””โ”€โ”€ web-content-extractor.js
โ””โ”€โ”€ ๐Ÿ“ src/                      # Source code
    โ”œโ”€โ”€ index.ts                 # MCP server main entry
    โ””โ”€โ”€ tools/
        โ””โ”€โ”€ web-content-extractor.ts # Web content extraction tool

๐Ÿ”„ Migration from Express Server

The original Express server (server.js) can still run independently:

npm start

The MCP version provides the same core functionality but integrates with AI assistants through the MCP protocol.

๐Ÿšจ Important Notes

  1. Lightweight Implementation: Uses HTTP client to fetch static content, no browser dependencies required
  2. Network Access: Requires access to target websites
  3. Static Content: Primarily suitable for static HTML content, dynamically rendered content may not be accessible
  4. Timeout Settings: For slow-loading websites, you can appropriately increase the timeout parameter
  5. Content Optimization: Automatically optimizes image link display for better readability

๐Ÿค Contributing

Welcome to submit Issues and Pull Requests! If you have any questions or suggestions, feel free to contact me.

๐Ÿ“ž Contact

๐Ÿ”— Related Links

๐Ÿ“„ License

MIT License - See file for details