mcp-web by geehexx - MCP Server

mcp-web

MCP Server for Intelligent Web Summarization

A powerful Model Context Protocol (MCP) server that provides intelligent URL summarization with content extraction, smart chunking, and LLM-based summarization.

Features

🚀 Fast & Robust Fetching: httpx primary with Playwright fallback for JS-heavy sites
🎯 Intelligent Content Extraction: trafilatura-powered main content extraction
📊 Smart Chunking: Hierarchical and semantic text splitting with configurable overlap
🤖 LLM Summarization: Map-reduce strategy for long documents with streaming output
🏠 Local LLM Support: Ollama, LM Studio, LocalAI, or cloud providers (OpenAI, Anthropic)
💾 Persistent Caching: Disk-based cache with TTL and LRU eviction
📈 Metrics & Logging: Comprehensive observability with structured logging
🔗 Link Following: Optional recursive link following for deeper context
📝 Markdown Output: Well-formatted summaries with citations and metadata
🧪 Comprehensive Testing: Unit, integration, security, golden, and benchmark tests

Quick Start

Installation

# Clone the repository
git clone https://github.com/geehexx/mcp-web.git
cd mcp-web

# Install Taskfile (recommended)
# macOS: brew install go-task/tap/go-task
# Linux: snap install task --classic
# Or see: https://taskfile.dev/installation/

# Setup complete environment (recommended)
task dev:setup

# Or manual installation
pip install -e ".[dev]"
playwright install chromium
```bash

### Configuration

#### Cloud LLM (OpenAI)

```bash
export OPENAI_API_KEY="sk-..."
export MCP_WEB_SUMMARIZER_PROVIDER=openai
export MCP_WEB_SUMMARIZER_MODEL=gpt-4o-mini

Local LLM (Ollama - Recommended)

# Install Ollama: https://ollama.com
# Start: ollama serve
# Pull model: ollama pull llama3.2:3b

export MCP_WEB_SUMMARIZER_PROVIDER=ollama
export MCP_WEB_SUMMARIZER_MODEL=llama3.2:3b

# Or use task commands
task llm:ollama:pull # Pull recommended models
task llm:ollama:start # Start Ollama server

See for complete local LLM setup.

Other Settings

# Cache settings
export MCP_WEB_CACHE_DIR="~/.cache/mcp-web"
export MCP_WEB_CACHE_TTL=604800 # 7 days

# Fetcher settings
export MCP_WEB_FETCHER_TIMEOUT=30
export MCP_WEB_FETCHER_MAX_CONCURRENT=5

# Summarizer settings
export MCP_WEB_SUMMARIZER_TEMPERATURE=0.3
export MCP_WEB_SUMMARIZER_MAX_TOKENS=2048

Usage

As an MCP Server

Add to your MCP client configuration (e.g., Claude Desktop):

{
 "mcpServers": {
 "mcp-web": {
 "command": "python",
 "args": ["-m", "mcp_web.mcp_server"]
 }
 }
}

Programmatic Usage

from mcp_web import create_server, load_config

# Create server
config = load_config()
mcp = create_server(config)

# Use the pipeline directly
from mcp_web.mcp_server import WebSummarizationPipeline

pipeline = WebSummarizationPipeline(config)

# Summarize URLs
async for chunk in pipeline.summarize_urls(
 urls=["https://example.com"],
 query="What are the key features?",
):
 print(chunk, end="")

Tools

`summarize_urls`

Summarize content from one or more URLs with optional query focus.

Parameters:

urls (List[str], required): URLs to summarize
query (str, optional): Question or topic to focus the summary on
follow_links (bool, default=False): Follow relevant outbound links
max_depth (int, default=1): Maximum link following depth

Example:

result = await summarize_urls(
 urls=["https://docs.python.org/3/library/asyncio.html"],
 query="How do I create async tasks?",
 follow_links=True
)

`get_cache_stats`

Get cache and metrics statistics.

Returns: Dictionary with cache size, hit rates, and processing metrics

`clear_cache`

Clear the entire cache.

`prune_cache`

Remove expired cache entries.

Architecture

Pipeline Flow

URLs → Fetch (httpx/Playwright) → Extract (trafilatura) →
Chunk (hierarchical/semantic) → Summarize (LLM map-reduce) →
Markdown Output (streaming)

Key Design Decisions

DD-001: httpx primary, Playwright fallback for robustness
DD-002: Trafilatura with favor_recall=True for maximum content extraction
DD-003: Hierarchical + semantic chunking preserves document structure
DD-004: 512-token chunks with 50-token overlap balances context
DD-006: Map-reduce summarization handles arbitrarily long documents
DD-007: 7-day disk cache with LRU eviction
DD-008: OpenAI GPT-4o-mini default (configurable)
DD-009: Streaming output for better UX

See for full design documentation.

Project Structure

mcp-web/
├── src/mcp_web/
│ ├── mcp_server.py # MCP tool entry point & orchestration
│ ├── fetcher.py # URL fetching (httpx + Playwright)
│ ├── extractor.py # Content extraction (trafilatura)
│ ├── chunker.py # Text chunking strategies
│ ├── summarizer.py # LLM summarization (map-reduce)
│ ├── cache.py # Disk cache manager
│ ├── metrics.py # Logging & metrics collection
│ ├── config.py # Configuration management
│ └── utils.py # Token counting, formatting
├── tests/ # Unit & integration tests
├── docs/ # Architecture & API documentation
├── examples/ # Example usage scripts
└── pyproject.toml # Dependencies & project metadata

Development

Using Taskfile (Recommended)

# Show all available tasks
task --list

# Complete setup
task dev:setup

# Run tests
task test # All tests except live
task test:fast # Unit + security + golden
task test:coverage # With coverage report
task test:parallel # Parallel execution

# Code quality
task lint # All linting
task format # Auto-format code
task security # Security scans
task analyze # Complete analysis

# CI simulation
task ci # Full CI pipeline
task ci:fast # Quick check

See for complete task reference.

Running Tests

# With Taskfile
task test # Recommended
task test:unit
task test:security
task test:golden # With local/cloud LLM

# Or with pytest directly
pytest -m "not live"
pytest -m unit
pytest -m golden

Code Quality

# With Taskfile (recommended)
task lint
task format
task security

# Or directly
ruff check src/ tests/
ruff format src/ tests/
mypy src/
bandit -r src/

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes with tests
Run linting and tests
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Configuration Reference

See for complete configuration options.

Key Settings

Setting	Default	Description
`FETCHER_TIMEOUT`	30	HTTP request timeout (seconds)
`FETCHER_MAX_CONCURRENT`	5	Max parallel fetches
`CHUNKER_CHUNK_SIZE`	512	Target tokens per chunk
`CHUNKER_OVERLAP`	50	Overlap between chunks (tokens)
`SUMMARIZER_MODEL`	gpt-4o-mini	LLM model to use
`SUMMARIZER_TEMPERATURE`	0.3	LLM temperature
`CACHE_TTL`	604800	Cache TTL (7 days)
`CACHE_MAX_SIZE`	1GB	Maximum cache size

Performance

Benchmarks

Single URL: ~5-10 seconds (with cache)
Multiple URLs (5): ~15-30 seconds (parallel fetching)
Large document (10k+ tokens): ~30-60 seconds (map-reduce)

Optimization Tips

Enable caching for repeated queries
Adjust chunk_size based on content type
Use gpt-4o-mini for cost-effective summaries
Limit max_depth for link following
Prune cache periodically

Troubleshooting

Common Issues

"No module named 'playwright'"

Run pip install playwright && playwright install chromium

"OPENAI_API_KEY not set"

Export your API key: export OPENAI_API_KEY="sk-..."

"Cache permission denied"

Check ~/.cache/mcp-web permissions or set custom CACHE_DIR

"Extraction returned empty content"

Site might be JS-heavy; fetcher will auto-fallback to Playwright
Some sites block scrapers; check robots.txt

Debug Mode

Enable debug logging:

export MCP_WEB_METRICS_LOG_LEVEL="DEBUG"
python -m mcp_web.mcp_server

Roadmap

v0.2.0 (Current)

Local LLM support (Ollama, LM Studio, LocalAI)
Comprehensive testing infrastructure
Security testing (OWASP LLM Top 10)
Taskfile for better tooling
Golden tests with deterministic verification

v0.3.0

PDF OCR support for scanned documents
Multi-language translation
Anthropic Claude integration
Vector embeddings for semantic search

v0.4.0

Per-domain extraction rules
Image/diagram extraction
Incremental summarization
Prometheus metrics export

See for full roadmap.

License

MIT License - see file for details.

Acknowledgments

Model Context Protocol by Anthropic
trafilatura for content extraction
httpx for async HTTP
Playwright for browser automation
tiktoken for token counting

Support

📚
🐛 Issue Tracker
💬 Discussions