cleanweb-mcp

guangxiangdebizi/cleanweb-mcp

3.1

If you are the rightful owner of cleanweb-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

CleanWeb MCP is a lightweight Model Context Protocol server designed to extract and clean web content, converting it into a clean Markdown format.

🌐 CleanWeb MCP

npm version GitHub stars License: MIT

A lightweight Model Context Protocol (MCP) server

Specialized in intelligently extracting core web content, automatically filtering ads and irrelevant elements, and converting to clean Markdown format

🚀 Quick Start📖 Documentation🔧 Configuration🤝 Contributing

✨ Features

🌐 Smart Extraction🧹 Content Cleaning📝 Format Conversion⚡ Lightweight Deploy
Axios + Cheerio + ReadabilityAuto-filter ads & distractionsHTML → MarkdownZero browser dependency

🎯 Core Advantages

  • 🌐 Smart Content Extraction: Uses Axios + Cheerio + Readability algorithm to extract main web content
  • 🧹 Intelligent Content Cleaning: Automatically removes ads, navigation, sidebars and other distracting elements
  • 📝 Markdown Conversion: Converts HTML content to clean Markdown format
  • 🖼️ Image Link Optimization: Automatically handles overly long image links for better readability
  • Lightweight Deployment: No browser dependencies, simple and fast deployment
  • 🔧 Multiple Output Formats: Supports pure Markdown or JSON format with metadata
  • 🚀 MCP Protocol: Fully compatible with Model Context Protocol standard

🛠️ Tech Stack

TypeScript Node.js Axios Cheerio

🚀 Quick Start

📦 Installation

# Install from npm
npm install cleanweb-mcp

# Or clone the repository
git clone https://github.com/guangxiangdebizi/cleanweb-mcp.git
cd cleanweb-mcp
npm install

💡 Advantage: Uses lightweight HTTP client, no browser download required, simpler deployment! Focused on content cleaning and optimization.

🔧 Build Project

npm run build

🎯 Usage

1. Stdio Mode (Local Development)

npm run mcp:stdio

2. SSE Mode (via Supergateway)

npm run mcp:sse

Server will start at http://localhost:3100/sse

3. WebSocket Mode

npm run mcp:ws

4. Development Mode (Watch file changes)

npm run mcp:dev

🛠️ Claude Configuration

Stdio Mode Configuration

Add to Claude's configuration file:

{
  "mcpServers": {
    "cleanweb-mcp": {
      "command": "node",
      "args": ["path/to/your/project/build/index.js"]
    }
  }
}

SSE Mode Configuration

{
  "mcpServers": {
    "cleanweb-mcp-sse": {
      "type": "sse",
      "url": "http://localhost:3100/sse",
      "timeout": 600
    }
  }
}

🔨 API Reference

extract_web_content

Intelligently extract web content and convert to Markdown format.

Parameters
ParameterTypeRequiredDefaultDescription
urlstring-The web URL to extract content from
formatstringmarkdownReturn format: markdown or json
timeoutnumber30000Page loading timeout (milliseconds)
Usage Examples
// Basic usage
extract_web_content({
  url: "https://example.com/article"
})

// Advanced usage
extract_web_content({
  url: "https://example.com/article",
  format: "json",
  timeout: 60000
})

📁 Project Structure

cleanweb-mcp/
├── 📄 README.md                 # Project documentation
├── 📦 package.json              # Project configuration
├── ⚙️ tsconfig.json             # TypeScript configuration
├── 🔧 claude-config-example.json # Claude configuration example
├── 📖 example-usage.md          # Usage examples
├── 🏗️ build/                    # Compiled output
│   ├── index.js
│   └── tools/
│       └── web-content-extractor.js
└── 📝 src/                      # Source code
    ├── index.ts                 # MCP server main entry
    └── tools/
        └── web-content-extractor.ts # Web content extraction tool

🔄 Migration from Express Server

The original Express server (server.js) can still run independently:

npm start

The MCP version provides the same core functionality but integrates with AI assistants through the MCP protocol.

🚨 Important Notes

  1. Lightweight Implementation: Uses HTTP client to fetch static content, no browser dependencies required
  2. Network Access: Requires access to target websites
  3. Static Content: Primarily suitable for static HTML content, dynamically rendered content may not be accessible
  4. Timeout Settings: For slow-loading websites, you can appropriately increase the timeout parameter
  5. Content Optimization: Automatically optimizes image link display for better readability

🤝 Contributing

Welcome to submit Issues and Pull Requests! If you have any questions or suggestions, feel free to contact me.

📞 Contact

🔗 Related Links

📄 License

MIT License - See file for details