web-search by vishalkg - MCP Server

WebSearch MCP Server

High-performance Model Context Protocol (MCP) server for web search and content extraction with intelligent fallback system.

✨ Features

🚀 Fast: Async implementation with parallel execution
🔍 Multi-Engine: Google, Bing, DuckDuckGo, Startpage, Brave Search
🛡️ Intelligent Fallbacks: Google→Startpage, Bing→DuckDuckGo, Brave (standalone)
📄 Content Extraction: Clean text extraction from web pages
💾 Smart Caching: LRU cache with compression and deduplication
🔑 API Integration: Google Custom Search, Brave Search APIs with quota management
⚡ Resilient: Automatic failover and comprehensive error handling

📦 Installation

Production Use (Recommended)

# Create virtual environment
python -m venv ~/.websearch/venv
source ~/.websearch/venv/bin/activate

# Install from GitHub
pip install git+https://github.com/vishalkg/web-search.git

Development

git clone https://github.com/vishalkg/web-search.git
cd web-search
pip install -e .

⚙️ Configuration

Q CLI

# Add to Q CLI (after installation)
q mcp add --name websearch --command ~/.websearch/venv/bin/websearch-server

# Test
q chat "search for python tutorials"

Claude Desktop

Add to your MCP settings file:

claude mcp add websearch ~/.websearch/venv/bin/websearch-server -s user

🗂️ File Structure (Installation Independent)

The server automatically creates and manages files in a unified user directory:

~/.websearch/                 # Single websearch directory
├── venv/                    # Virtual environment (recommended)
├── config/
│   └── .env                 # Configuration file
├── data/
│   ├── search-metrics.jsonl # Search analytics
│   └── quota/              # API quota tracking
│       ├── google_quota.json
│       └── brave_quota.json
├── logs/
│   └── web-search.log      # Application logs
└── cache/                  # Optional caching

Environment Variable Overrides

WEBSEARCH_HOME: Base directory (default: ~/.websearch)
WEBSEARCH_CONFIG_DIR: Config directory override
WEBSEARCH_LOG_DIR: Log directory override

🔧 Usage

The server provides two main tools with multiple search modes:

Search Web

# Standard 5-engine search (backward compatible)
search_web("quantum computing applications", num_results=10)

# New 3-engine fallback search (optimized)
search_web_fallback("machine learning tutorials", num_results=5)

Search Engines:

Google Custom Search API (with Startpage fallback)
Bing (with DuckDuckGo fallback)
Brave Search API (standalone)
DuckDuckGo (scraping)
Startpage (scraping)

Fetch Page Content

# Extract clean text from URLs
fetch_page_content("https://example.com")
fetch_page_content(["https://site1.com", "https://site2.com"])  # Batch processing

🏗️ Architecture

websearch/
├── core/
│   ├── search.py              # Sync search orchestration
│   ├── async_search.py        # Async search orchestration
│   ├── fallback_search.py     # 3-engine fallback system
│   ├── async_fallback_search.py # Async fallback system
│   ├── ranking.py             # Quality-first result ranking
│   └── common.py              # Shared utilities
├── engines/
│   ├── google_api.py          # Google Custom Search API
│   ├── brave_api.py           # Brave Search API
│   ├── bing.py                # Bing scraping
│   ├── duckduckgo.py          # DuckDuckGo scraping
│   └── startpage.py           # Startpage scraping
├── utils/
│   ├── unified_quota.py       # Unified API quota management
│   ├── deduplication.py       # Result deduplication
│   ├── advanced_cache.py      # Enhanced caching system
│   └── http.py                # HTTP utilities
└── server.py                  # FastMCP server

🔧 Advanced Configuration

Environment Variables

# API Configuration
export GOOGLE_CSE_API_KEY=your_google_api_key
export GOOGLE_CSE_ID=your_google_cse_id
export BRAVE_SEARCH_API_KEY=your_brave_api_key

# Quota Management (Optional)
export GOOGLE_DAILY_QUOTA=100        # Default: 100 requests/day
export BRAVE_MONTHLY_QUOTA=2000      # Default: 2000 requests/month

# Performance Tuning
export WEBSEARCH_CACHE_SIZE=1000
export WEBSEARCH_TIMEOUT=10
export WEBSEARCH_LOG_LEVEL=INFO

How to Get API Keys

Google Custom Search API

API Key: Go to https://developers.google.com/custom-search/v1/introduction and click "Get a Key"
CSE ID: Go to https://cse.google.com/cse/ and follow prompts to create a search engine

Brave Search API

Go to Brave Search API
Sign up for a free account
Go to your dashboard
Copy the API key as BRAVE_API_KEY
Free tier: 2000 requests/month

Quota Management

Unified System: Single quota manager for all APIs
Google: Daily quota (default 100 requests/day)
Brave: Monthly quota (default 2000 requests/month)
Storage: Quota files stored in ~/.websearch/ directory
Auto-reset: Quotas automatically reset at period boundaries
Fallback: Automatic fallback to scraping when quotas exhausted

Search Modes

Standard Mode: Uses all 5 engines for maximum coverage
Fallback Mode: Uses 3 engines with intelligent fallbacks for efficiency
API-First Mode: Prioritizes API calls over scraping when keys available

🐛 Troubleshooting

Issue	Solution
No results	Check internet connection and logs
API quota exhausted	System automatically falls back to scraping
Google API errors	Verify `GOOGLE_CSE_API_KEY` and `GOOGLE_CSE_ID`
Brave API errors	Check `BRAVE_SEARCH_API_KEY` and quota status
Permission denied	`chmod +x start.sh`
Import errors	Ensure Python 3.12+ and dependencies installed
Circular import warnings	Fixed in v2.0+ (10.00/10 pylint score)

Debug Mode

# Enable detailed logging
export WEBSEARCH_LOG_LEVEL=DEBUG
python -m websearch.server

API Status Check

# Test API connectivity
cd debug/
python test_brave_api.py      # Test Brave API
python test_fallback.py       # Test fallback system

📈 Performance & Monitoring

Metrics

Pylint Score: 10.00/10 (perfect code quality)
Search Speed: ~2-3 seconds for 5-engine search
Fallback Speed: ~1-2 seconds for 3-engine search
Cache Hit Rate: ~85% for repeated queries
API Quota Efficiency: Automatic fallback prevents service interruption

Monitoring

Logs are written to web-search.log with structured format:

tail -f web-search.log | grep "search completed"

🔒 Security

No hardcoded secrets: All API keys via environment variables
Clean git history: Secrets scrubbed from all commits
Input validation: Comprehensive sanitization of search queries
Rate limiting: Built-in quota management for API calls
Secure defaults: HTTPS-only requests, timeout protection

🚀 Performance Tips

Use fallback mode for faster searches when you don't need maximum coverage
Set API keys to reduce reliance on scraping (faster + more reliable)
Enable caching for repeated queries (enabled by default)
Tune batch sizes for content extraction based on your needs

🤝 Contributing

Fork the repository
Create feature branch (git checkout -b feature/amazing-feature)
Run tests (pytest)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open Pull Request

📄 License

MIT License - see file for details.