light-research-mcp

Code-Hex/light-research-mcp

3.4

If you are the rightful owner of light-research-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

A lightweight MCP server for LLM orchestration that provides efficient web content search and extraction capabilities.

LLM Researcher

A lightweight MCP (Model Context Protocol) server for LLM orchestration that provides efficient web content search and extraction capabilities. This CLI tool enables LLMs to search DuckDuckGo and extract clean, LLM-friendly content from web pages.

Built with TypeScript, tsup, and vitest for modern development experience.

Features

  • MCP Server Support: Provides Model Context Protocol server for LLM integration
  • Free Operation: Uses DuckDuckGo HTML endpoint (no API costs)
  • GitHub Code Search: Search GitHub repositories for code examples and implementation patterns
  • Smart Content Extraction: Playwright + @mozilla/readability for clean content
  • LLM-Optimized Output: Sanitized Markdown (h1-h3, bold, italic, links only)
  • Rate Limited: Respects DuckDuckGo with 1 req/sec limit
  • Cross-Platform: Works on macOS, Linux, and WSL
  • Multiple Modes: CLI, MCP server, search, direct URL, and interactive modes
  • Type Safe: Full TypeScript implementation with strict typing
  • Modern Tooling: Built with tsup bundler and vitest testing

Installation

Prerequisites

  • Node.js 20.0.0 or higher
  • No local Chrome installation required (uses Playwright's bundled Chromium)

Setup

# Clone or download the project
cd light-research-mcp

# Install dependencies (using pnpm)
pnpm install

# Build the project
pnpm build

# Install Playwright browsers
pnpm install-browsers

# Optional: Link globally for system-wide access
pnpm link --global

Usage

MCP Server Mode

Use as a Model Context Protocol server to provide search and content extraction tools to LLMs:

# Start MCP server (stdio transport)
llmresearcher --mcp

# The server provides these tools to MCP clients:
# - github_code_search: Search GitHub repositories for code
# - duckduckgo_web_search: Search the web with DuckDuckGo
# - extract_content: Extract detailed content from URLs
Setting up with Claude Code
# Add as an MCP server to Claude Code
claude mcp add light-research-mcp /path/to/light-research-mcp/dist/bin/llmresearcher.js --mcp

# Or with project scope for team sharing
claude mcp add light-research-mcp -s project /path/to/light-research-mcp/dist/bin/llmresearcher.js --mcp

# List configured servers
claude mcp list

# Check server status
claude mcp get light-research-mcp
MCP Tool Usage Examples

Once configured, you can use these tools in Claude:

> Search for React hooks examples on GitHub
Tool: github_code_search
Query: "useState useEffect hooks language:javascript"

> Search for TypeScript best practices
Tool: duckduckgo_web_search  
Query: "TypeScript best practices 2024"
Locale: us-en (or wt-wt for no region)

> Extract content from a search result
Tool: extract_content
URL: https://example.com/article-from-search-results

Command Line Interface

# Search mode - Search DuckDuckGo and interactively browse results
llmresearcher "machine learning transformers"

# GitHub Code Search mode - Search GitHub for code
llmresearcher -g "useState hooks language:typescript"

# Direct URL mode - Extract content from specific URL
llmresearcher -u https://example.com/article

# Interactive mode - Enter interactive search session
llmresearcher

# Verbose logging - See detailed operation logs
llmresearcher -v "search query"

# MCP Server mode - Start as Model Context Protocol server
llmresearcher --mcp

Development

Scripts

# Build the project
pnpm build

# Build in watch mode (for development)
pnpm dev

# Run tests
pnpm test

# Run tests in CI mode (single run)
pnpm test:run

# Type checking
pnpm type-check

# Clean build artifacts
pnpm clean

# Install Playwright browsers
pnpm install-browsers

Interactive Commands

When in search results view:

  • 1-10: Select a result by number
  • b or back: Return to search results
  • open <n>: Open result #n in external browser
  • q or quit: Exit the program

When viewing content:

  • b or back: Return to search results
  • /<term>: Search for term within the extracted content
  • open: Open current page in external browser
  • q or quit: Exit the program

Configuration

Environment Variables

Create a .env file in the project root:

USER_AGENT=Mozilla/5.0 (compatible; LLMResearcher/1.0)
TIMEOUT=30000
MAX_RETRIES=3
RATE_LIMIT_DELAY=1000
CACHE_ENABLED=true
MAX_RESULTS=10

Configuration File

Create ~/.llmresearcherrc in your home directory:

{
  "userAgent": "Mozilla/5.0 (compatible; LLMResearcher/1.0)",
  "timeout": 30000,
  "maxRetries": 3,
  "rateLimitDelay": 1000,
  "cacheEnabled": true,
  "maxResults": 10
}

Configuration Options

OptionDefaultDescription
userAgentMozilla/5.0 (compatible; LLMResearcher/1.0)User agent for HTTP requests
timeout30000Request timeout in milliseconds
maxRetries3Maximum retry attempts for failed requests
rateLimitDelay1000Delay between requests in milliseconds
cacheEnabledtrueEnable/disable local caching
maxResults10Maximum search results to display

Architecture

Core Components

  1. MCPResearchServer (src/mcp-server.ts)

    • Model Context Protocol server implementation
    • Three main tools: github_code_search, duckduckgo_web_search, extract_content
    • JSON-based responses for LLM consumption
  2. DuckDuckGoSearcher (src/search.ts)

    • HTML scraping of DuckDuckGo search results with locale support
    • URL decoding for /l/?uddg= format links
    • Rate limiting and retry logic
  3. GitHubCodeSearcher (src/github-code-search.ts)

    • GitHub Code Search API integration via gh CLI
    • Advanced query support with language, repo, and file filters
    • Authentication and rate limiting
  4. ContentExtractor (src/extractor.ts)

    • Playwright-based page rendering with resource blocking
    • @mozilla/readability for main content extraction
    • DOMPurify sanitization and Markdown conversion
  5. CLIInterface (src/cli.ts)

    • Interactive command-line interface
    • Search result navigation
    • Content viewing and text search
  6. Configuration (src/config.ts)

    • Environment and RC file configuration loading
    • Verbose logging support

Content Processing Pipeline

MCP Server Mode
  1. Search:
    • DuckDuckGo: HTML endpoint โ†’ Parse results โ†’ JSON response with pagination
    • GitHub: Code Search API โ†’ Format results โ†’ JSON response with code snippets
  2. Extract: URL from search results โ†’ Playwright navigation โ†’ Content extraction
  3. Process: @mozilla/readability โ†’ DOMPurify sanitization โ†’ Clean JSON output
  4. Output: Structured JSON for LLM consumption
CLI Mode
  1. Search: DuckDuckGo HTML endpoint โ†’ Parse results โ†’ Display numbered list
  2. Extract: Playwright navigation โ†’ Resource blocking โ†’ JS rendering
  3. Process: @mozilla/readability โ†’ DOMPurify sanitization โ†’ Turndown Markdown
  4. Output: Clean Markdown with h1-h3, bold, italic, only

Security Features

  • Resource Blocking: Prevents loading of images, CSS, fonts for speed and security
  • Content Sanitization: DOMPurify removes scripts, iframes, and dangerous elements
  • Limited Markdown: Only allows safe formatting elements (h1-h3, strong, em, a)
  • Rate Limiting: Respects DuckDuckGo's rate limits with exponential backoff

Examples

MCP Server Usage with Claude Code

1. GitHub Code Search
You: "Find React hook examples for state management"

Claude uses github_code_search tool:
{
  "query": "useState useReducer state management language:javascript",
  "results": [
    {
      "title": "facebook/react/packages/react/src/ReactHooks.js",
      "url": "https://raw.githubusercontent.com/facebook/react/main/packages/react/src/ReactHooks.js",
      "snippet": "function useState(initialState) {\n  return dispatcher.useState(initialState);\n}"
    }
  ],
  "pagination": {
    "currentPage": 1,
    "hasNextPage": true,
    "nextPageToken": "2"
  }
}
2. Web Search with Locale
You: "Search for Vue.js tutorials in Japanese"

Claude uses duckduckgo_web_search tool:
{
  "query": "Vue.js ใƒใƒฅใƒผใƒˆใƒชใ‚ขใƒซ ๅ…ฅ้–€",
  "locale": "jp-jp",
  "results": [
    {
      "title": "Vue.jsๅ…ฅ้–€ใ‚ฌใ‚คใƒ‰",
      "url": "https://example.com/vue-tutorial",
      "snippet": "Vue.jsใฎๅŸบๆœฌ็š„ใชไฝฟใ„ๆ–นใ‚’ๅญฆใถใƒใƒฅใƒผใƒˆใƒชใ‚ขใƒซ..."
    }
  ]
}
3. Content Extraction
You: "Extract the full content from that Vue.js tutorial"

Claude uses extract_content tool:
{
  "url": "https://example.com/vue-tutorial",
  "title": "Vue.jsๅ…ฅ้–€ใ‚ฌใ‚คใƒ‰",
  "extractedAt": "2024-01-15T10:30:00.000Z",
  "content": "# Vue.jsๅ…ฅ้–€ใ‚ฌใ‚คใƒ‰\n\nVue.jsใฏ...\n\n## ใ‚คใƒณใ‚นใƒˆใƒผใƒซ\n\n..."
}

CLI Examples

Basic Search
$ llmresearcher "python web scraping"

๐Ÿ” Search Results:
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

1. Python Web Scraping Tutorial
   URL: https://realpython.com/python-web-scraping-practical-introduction/
   Complete guide to web scraping with Python using requests and Beautiful Soup...

2. Web Scraping with Python - BeautifulSoup and requests
   URL: https://www.dataquest.io/blog/web-scraping-python-tutorial/
   Learn how to scrape websites with Python, Beautiful Soup, and requests...

โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
Commands: [1-10] select result | b) back | q) quit | open <n>) open in browser

> 1

๐Ÿ“ฅ Extracting content from: Python Web Scraping Tutorial

๐Ÿ“„ Content:
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

**Python Web Scraping Tutorial**
Source: https://realpython.com/python-web-scraping-practical-introduction/
Extracted: 2024-01-15T10:30:00.000Z

โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

# Python Web Scraping: A Practical Introduction

Web scraping is the process of collecting and parsing raw data from the web...

## What Is Web Scraping?

Web scraping is a technique to automatically access and extract large amounts...

โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
Commands: b) back to results | /<term>) search in text | q) quit | open) open in browser

> /beautiful soup

๐Ÿ” Found 3 matches for "beautiful soup":
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Line 15: Beautiful Soup is a Python library for parsing HTML and XML documents.
Line 42: from bs4 import BeautifulSoup
Line 67: soup = BeautifulSoup(html_content, 'html.parser')

Direct URL Mode

$ llmresearcher -u https://docs.python.org/3/tutorial/

๐Ÿ“„ Content:
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

**The Python Tutorial**
Source: https://docs.python.org/3/tutorial/
Extracted: 2024-01-15T10:35:00.000Z

โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

# The Python Tutorial

Python is an easy to learn, powerful programming language...

## An Informal Introduction to Python

In the following examples, input and output are distinguished...

Verbose Mode

$ llmresearcher -v "nodejs tutorial"

[VERBOSE] Searching: https://duckduckgo.com/html/?q=nodejs%20tutorial&kl=us-en
[VERBOSE] Response: 200 in 847ms
[VERBOSE] Parsed 10 results
[VERBOSE] Launching browser...
[VERBOSE] Blocking resource: https://example.com/style.css
[VERBOSE] Blocking resource: https://example.com/image.png
[VERBOSE] Navigating to page...
[VERBOSE] Page loaded in 1243ms
[VERBOSE] Processing content with Readability...
[VERBOSE] Readability extraction successful
[VERBOSE] Closing browser...

Testing

Running Tests

# Run tests in watch mode
pnpm test

# Run tests once (CI mode)
pnpm test:run

# Run tests with coverage
pnpm test -- --coverage

Test Coverage

The test suite includes:

  • Unit Tests: Individual component testing

    • search.test.ts: DuckDuckGo search functionality, URL decoding, rate limiting
    • extractor.test.ts: Content extraction, Markdown conversion, resource management
    • config.test.ts: Configuration validation and environment handling
  • Integration Tests: End-to-end workflow testing

    • integration.test.ts: Complete search-to-extraction workflows, error handling, cleanup

Test Features

  • Fast: Powered by vitest for quick feedback
  • Type-safe: Full TypeScript support in tests
  • Isolated: Each test cleans up its resources
  • Comprehensive: Covers search, extraction, configuration, and integration scenarios

Troubleshooting

Common Issues

"Browser not found" Error

pnpm install-browsers

Rate Limiting Issues

  • The tool automatically handles rate limiting with 1-second delays
  • If you encounter 429 errors, the tool will automatically retry with exponential backoff

Content Extraction Failures

  • Some sites may block automated access
  • The tool includes fallback extraction methods (main โ†’ body content)
  • Use verbose mode (-v) to see detailed error information

Permission Denied (Unix/Linux)

chmod +x bin/llmresearcher.js

Performance Optimization

The tool is optimized for speed:

  • Resource Blocking: Automatically blocks images, CSS, fonts
  • Network Idle: Waits for JavaScript to complete rendering
  • Content Caching: Supports local caching to avoid repeated requests
  • Minimal Dependencies: Uses lightweight, focused libraries

Development

Project Structure

light-research-mcp/
โ”œโ”€โ”€ dist/                      # Built JavaScript files (generated)
โ”‚   โ”œโ”€โ”€ bin/
โ”‚   โ”‚   โ””โ”€โ”€ llmresearcher.js   # CLI entry point (executable)
โ”‚   โ””โ”€โ”€ *.js                   # Compiled TypeScript modules
โ”œโ”€โ”€ src/                       # TypeScript source files
โ”‚   โ”œโ”€โ”€ bin.ts                 # CLI entry point
โ”‚   โ”œโ”€โ”€ index.ts               # Main LLMResearcher class
โ”‚   โ”œโ”€โ”€ mcp-server.ts          # MCP server implementation
โ”‚   โ”œโ”€โ”€ search.ts              # DuckDuckGo search implementation
โ”‚   โ”œโ”€โ”€ github-code-search.ts  # GitHub Code Search implementation
โ”‚   โ”œโ”€โ”€ extractor.ts           # Content extraction with Playwright
โ”‚   โ”œโ”€โ”€ cli.ts                 # Interactive CLI interface
โ”‚   โ”œโ”€โ”€ config.ts              # Configuration management
โ”‚   โ””โ”€โ”€ types.ts               # TypeScript type definitions
โ”œโ”€โ”€ test/                      # Test files (vitest)
โ”‚   โ”œโ”€โ”€ search.test.ts         # Search functionality tests
โ”‚   โ”œโ”€โ”€ extractor.test.ts      # Content extraction tests
โ”‚   โ”œโ”€โ”€ config.test.ts         # Configuration tests
โ”‚   โ”œโ”€โ”€ mcp-locale.test.ts     # MCP locale functionality tests
โ”‚   โ”œโ”€โ”€ mcp-content-extractor.test.ts # MCP content extractor tests
โ”‚   โ””โ”€โ”€ integration.test.ts    # End-to-end integration tests
โ”œโ”€โ”€ tsconfig.json              # TypeScript configuration
โ”œโ”€โ”€ tsup.config.ts             # Build configuration
โ”œโ”€โ”€ vitest.config.ts           # Test configuration
โ”œโ”€โ”€ package.json
โ””โ”€โ”€ README.md

Dependencies

Runtime Dependencies
  • @modelcontextprotocol/sdk: Model Context Protocol server implementation
  • @mozilla/readability: Content extraction from HTML
  • cheerio: HTML parsing for search results
  • commander: CLI argument parsing
  • dompurify: HTML sanitization
  • dotenv: Environment variable loading
  • jsdom: DOM manipulation for server-side processing
  • playwright: Browser automation for JS rendering
  • turndown: HTML to Markdown conversion
Development Dependencies
  • typescript: TypeScript compiler
  • tsup: Fast TypeScript bundler
  • vitest: Fast unit test framework
  • @types/*: TypeScript type definitions

License

MIT License - see LICENSE file for details.

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

Roadmap

Planned Features

  • Enhanced MCP Tools: Additional specialized search tools for documentation, APIs, etc.
  • Caching Layer: SQLite-based URL โ†’ Markdown caching with 24-hour TTL
  • Search Engine Abstraction: Support for Brave Search, Bing, and other engines
  • Content Summarization: Optional AI-powered content summarization
  • Export Formats: JSON, plain text, and other output formats
  • Batch Processing: Process multiple URLs from file input
  • SSE Transport: Support for Server-Sent Events MCP transport

Performance Improvements

  • Parallel Processing: Concurrent content extraction for multiple results
  • Smart Caching: Intelligent cache invalidation based on content freshness
  • Memory Optimization: Streaming content processing for large documents