Code-Hex/light-research-mcp
If you are the rightful owner of light-research-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
A lightweight MCP server for LLM orchestration that provides efficient web content search and extraction capabilities.
LLM Researcher
A lightweight MCP (Model Context Protocol) server for LLM orchestration that provides efficient web content search and extraction capabilities. This CLI tool enables LLMs to search DuckDuckGo and extract clean, LLM-friendly content from web pages.
Built with TypeScript, tsup, and vitest for modern development experience.
Features
- MCP Server Support: Provides Model Context Protocol server for LLM integration
- Free Operation: Uses DuckDuckGo HTML endpoint (no API costs)
- GitHub Code Search: Search GitHub repositories for code examples and implementation patterns
- Smart Content Extraction: Playwright + @mozilla/readability for clean content
- LLM-Optimized Output: Sanitized Markdown (h1-h3, bold, italic, links only)
- Rate Limited: Respects DuckDuckGo with 1 req/sec limit
- Cross-Platform: Works on macOS, Linux, and WSL
- Multiple Modes: CLI, MCP server, search, direct URL, and interactive modes
- Type Safe: Full TypeScript implementation with strict typing
- Modern Tooling: Built with tsup bundler and vitest testing
Installation
Prerequisites
- Node.js 20.0.0 or higher
- No local Chrome installation required (uses Playwright's bundled Chromium)
Setup
# Clone or download the project
cd light-research-mcp
# Install dependencies (using pnpm)
pnpm install
# Build the project
pnpm build
# Install Playwright browsers
pnpm install-browsers
# Optional: Link globally for system-wide access
pnpm link --global
Usage
MCP Server Mode
Use as a Model Context Protocol server to provide search and content extraction tools to LLMs:
# Start MCP server (stdio transport)
llmresearcher --mcp
# The server provides these tools to MCP clients:
# - github_code_search: Search GitHub repositories for code
# - duckduckgo_web_search: Search the web with DuckDuckGo
# - extract_content: Extract detailed content from URLs
Setting up with Claude Code
# Add as an MCP server to Claude Code
claude mcp add light-research-mcp /path/to/light-research-mcp/dist/bin/llmresearcher.js --mcp
# Or with project scope for team sharing
claude mcp add light-research-mcp -s project /path/to/light-research-mcp/dist/bin/llmresearcher.js --mcp
# List configured servers
claude mcp list
# Check server status
claude mcp get light-research-mcp
MCP Tool Usage Examples
Once configured, you can use these tools in Claude:
> Search for React hooks examples on GitHub
Tool: github_code_search
Query: "useState useEffect hooks language:javascript"
> Search for TypeScript best practices
Tool: duckduckgo_web_search
Query: "TypeScript best practices 2024"
Locale: us-en (or wt-wt for no region)
> Extract content from a search result
Tool: extract_content
URL: https://example.com/article-from-search-results
Command Line Interface
# Search mode - Search DuckDuckGo and interactively browse results
llmresearcher "machine learning transformers"
# GitHub Code Search mode - Search GitHub for code
llmresearcher -g "useState hooks language:typescript"
# Direct URL mode - Extract content from specific URL
llmresearcher -u https://example.com/article
# Interactive mode - Enter interactive search session
llmresearcher
# Verbose logging - See detailed operation logs
llmresearcher -v "search query"
# MCP Server mode - Start as Model Context Protocol server
llmresearcher --mcp
Development
Scripts
# Build the project
pnpm build
# Build in watch mode (for development)
pnpm dev
# Run tests
pnpm test
# Run tests in CI mode (single run)
pnpm test:run
# Type checking
pnpm type-check
# Clean build artifacts
pnpm clean
# Install Playwright browsers
pnpm install-browsers
Interactive Commands
When in search results view:
- 1-10: Select a result by number
- b or back: Return to search results
- open <n>: Open result #n in external browser
- q or quit: Exit the program
When viewing content:
- b or back: Return to search results
- /<term>: Search for term within the extracted content
- open: Open current page in external browser
- q or quit: Exit the program
Configuration
Environment Variables
Create a .env
file in the project root:
USER_AGENT=Mozilla/5.0 (compatible; LLMResearcher/1.0)
TIMEOUT=30000
MAX_RETRIES=3
RATE_LIMIT_DELAY=1000
CACHE_ENABLED=true
MAX_RESULTS=10
Configuration File
Create ~/.llmresearcherrc
in your home directory:
{
"userAgent": "Mozilla/5.0 (compatible; LLMResearcher/1.0)",
"timeout": 30000,
"maxRetries": 3,
"rateLimitDelay": 1000,
"cacheEnabled": true,
"maxResults": 10
}
Configuration Options
Option | Default | Description |
---|---|---|
userAgent | Mozilla/5.0 (compatible; LLMResearcher/1.0) | User agent for HTTP requests |
timeout | 30000 | Request timeout in milliseconds |
maxRetries | 3 | Maximum retry attempts for failed requests |
rateLimitDelay | 1000 | Delay between requests in milliseconds |
cacheEnabled | true | Enable/disable local caching |
maxResults | 10 | Maximum search results to display |
Architecture
Core Components
-
MCPResearchServer (
src/mcp-server.ts
)- Model Context Protocol server implementation
- Three main tools: github_code_search, duckduckgo_web_search, extract_content
- JSON-based responses for LLM consumption
-
DuckDuckGoSearcher (
src/search.ts
)- HTML scraping of DuckDuckGo search results with locale support
- URL decoding for
/l/?uddg=
format links - Rate limiting and retry logic
-
GitHubCodeSearcher (
src/github-code-search.ts
)- GitHub Code Search API integration via gh CLI
- Advanced query support with language, repo, and file filters
- Authentication and rate limiting
-
ContentExtractor (
src/extractor.ts
)- Playwright-based page rendering with resource blocking
- @mozilla/readability for main content extraction
- DOMPurify sanitization and Markdown conversion
-
CLIInterface (
src/cli.ts
)- Interactive command-line interface
- Search result navigation
- Content viewing and text search
-
Configuration (
src/config.ts
)- Environment and RC file configuration loading
- Verbose logging support
Content Processing Pipeline
MCP Server Mode
- Search:
- DuckDuckGo: HTML endpoint โ Parse results โ JSON response with pagination
- GitHub: Code Search API โ Format results โ JSON response with code snippets
- Extract: URL from search results โ Playwright navigation โ Content extraction
- Process: @mozilla/readability โ DOMPurify sanitization โ Clean JSON output
- Output: Structured JSON for LLM consumption
CLI Mode
- Search: DuckDuckGo HTML endpoint โ Parse results โ Display numbered list
- Extract: Playwright navigation โ Resource blocking โ JS rendering
- Process: @mozilla/readability โ DOMPurify sanitization โ Turndown Markdown
- Output: Clean Markdown with h1-h3, bold, italic, only
Security Features
- Resource Blocking: Prevents loading of images, CSS, fonts for speed and security
- Content Sanitization: DOMPurify removes scripts, iframes, and dangerous elements
- Limited Markdown: Only allows safe formatting elements (h1-h3, strong, em, a)
- Rate Limiting: Respects DuckDuckGo's rate limits with exponential backoff
Examples
MCP Server Usage with Claude Code
1. GitHub Code Search
You: "Find React hook examples for state management"
Claude uses github_code_search tool:
{
"query": "useState useReducer state management language:javascript",
"results": [
{
"title": "facebook/react/packages/react/src/ReactHooks.js",
"url": "https://raw.githubusercontent.com/facebook/react/main/packages/react/src/ReactHooks.js",
"snippet": "function useState(initialState) {\n return dispatcher.useState(initialState);\n}"
}
],
"pagination": {
"currentPage": 1,
"hasNextPage": true,
"nextPageToken": "2"
}
}
2. Web Search with Locale
You: "Search for Vue.js tutorials in Japanese"
Claude uses duckduckgo_web_search tool:
{
"query": "Vue.js ใใฅใผใใชใขใซ ๅ
ฅ้",
"locale": "jp-jp",
"results": [
{
"title": "Vue.jsๅ
ฅ้ใฌใคใ",
"url": "https://example.com/vue-tutorial",
"snippet": "Vue.jsใฎๅบๆฌ็ใชไฝฟใๆนใๅญฆใถใใฅใผใใชใขใซ..."
}
]
}
3. Content Extraction
You: "Extract the full content from that Vue.js tutorial"
Claude uses extract_content tool:
{
"url": "https://example.com/vue-tutorial",
"title": "Vue.jsๅ
ฅ้ใฌใคใ",
"extractedAt": "2024-01-15T10:30:00.000Z",
"content": "# Vue.jsๅ
ฅ้ใฌใคใ\n\nVue.jsใฏ...\n\n## ใคใณในใใผใซ\n\n..."
}
CLI Examples
Basic Search
$ llmresearcher "python web scraping"
๐ Search Results:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1. Python Web Scraping Tutorial
URL: https://realpython.com/python-web-scraping-practical-introduction/
Complete guide to web scraping with Python using requests and Beautiful Soup...
2. Web Scraping with Python - BeautifulSoup and requests
URL: https://www.dataquest.io/blog/web-scraping-python-tutorial/
Learn how to scrape websites with Python, Beautiful Soup, and requests...
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Commands: [1-10] select result | b) back | q) quit | open <n>) open in browser
> 1
๐ฅ Extracting content from: Python Web Scraping Tutorial
๐ Content:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
**Python Web Scraping Tutorial**
Source: https://realpython.com/python-web-scraping-practical-introduction/
Extracted: 2024-01-15T10:30:00.000Z
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Python Web Scraping: A Practical Introduction
Web scraping is the process of collecting and parsing raw data from the web...
## What Is Web Scraping?
Web scraping is a technique to automatically access and extract large amounts...
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Commands: b) back to results | /<term>) search in text | q) quit | open) open in browser
> /beautiful soup
๐ Found 3 matches for "beautiful soup":
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Line 15: Beautiful Soup is a Python library for parsing HTML and XML documents.
Line 42: from bs4 import BeautifulSoup
Line 67: soup = BeautifulSoup(html_content, 'html.parser')
Direct URL Mode
$ llmresearcher -u https://docs.python.org/3/tutorial/
๐ Content:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
**The Python Tutorial**
Source: https://docs.python.org/3/tutorial/
Extracted: 2024-01-15T10:35:00.000Z
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# The Python Tutorial
Python is an easy to learn, powerful programming language...
## An Informal Introduction to Python
In the following examples, input and output are distinguished...
Verbose Mode
$ llmresearcher -v "nodejs tutorial"
[VERBOSE] Searching: https://duckduckgo.com/html/?q=nodejs%20tutorial&kl=us-en
[VERBOSE] Response: 200 in 847ms
[VERBOSE] Parsed 10 results
[VERBOSE] Launching browser...
[VERBOSE] Blocking resource: https://example.com/style.css
[VERBOSE] Blocking resource: https://example.com/image.png
[VERBOSE] Navigating to page...
[VERBOSE] Page loaded in 1243ms
[VERBOSE] Processing content with Readability...
[VERBOSE] Readability extraction successful
[VERBOSE] Closing browser...
Testing
Running Tests
# Run tests in watch mode
pnpm test
# Run tests once (CI mode)
pnpm test:run
# Run tests with coverage
pnpm test -- --coverage
Test Coverage
The test suite includes:
-
Unit Tests: Individual component testing
search.test.ts
: DuckDuckGo search functionality, URL decoding, rate limitingextractor.test.ts
: Content extraction, Markdown conversion, resource managementconfig.test.ts
: Configuration validation and environment handling
-
Integration Tests: End-to-end workflow testing
integration.test.ts
: Complete search-to-extraction workflows, error handling, cleanup
Test Features
- Fast: Powered by vitest for quick feedback
- Type-safe: Full TypeScript support in tests
- Isolated: Each test cleans up its resources
- Comprehensive: Covers search, extraction, configuration, and integration scenarios
Troubleshooting
Common Issues
"Browser not found" Error
pnpm install-browsers
Rate Limiting Issues
- The tool automatically handles rate limiting with 1-second delays
- If you encounter 429 errors, the tool will automatically retry with exponential backoff
Content Extraction Failures
- Some sites may block automated access
- The tool includes fallback extraction methods (main โ body content)
- Use verbose mode (
-v
) to see detailed error information
Permission Denied (Unix/Linux)
chmod +x bin/llmresearcher.js
Performance Optimization
The tool is optimized for speed:
- Resource Blocking: Automatically blocks images, CSS, fonts
- Network Idle: Waits for JavaScript to complete rendering
- Content Caching: Supports local caching to avoid repeated requests
- Minimal Dependencies: Uses lightweight, focused libraries
Development
Project Structure
light-research-mcp/
โโโ dist/ # Built JavaScript files (generated)
โ โโโ bin/
โ โ โโโ llmresearcher.js # CLI entry point (executable)
โ โโโ *.js # Compiled TypeScript modules
โโโ src/ # TypeScript source files
โ โโโ bin.ts # CLI entry point
โ โโโ index.ts # Main LLMResearcher class
โ โโโ mcp-server.ts # MCP server implementation
โ โโโ search.ts # DuckDuckGo search implementation
โ โโโ github-code-search.ts # GitHub Code Search implementation
โ โโโ extractor.ts # Content extraction with Playwright
โ โโโ cli.ts # Interactive CLI interface
โ โโโ config.ts # Configuration management
โ โโโ types.ts # TypeScript type definitions
โโโ test/ # Test files (vitest)
โ โโโ search.test.ts # Search functionality tests
โ โโโ extractor.test.ts # Content extraction tests
โ โโโ config.test.ts # Configuration tests
โ โโโ mcp-locale.test.ts # MCP locale functionality tests
โ โโโ mcp-content-extractor.test.ts # MCP content extractor tests
โ โโโ integration.test.ts # End-to-end integration tests
โโโ tsconfig.json # TypeScript configuration
โโโ tsup.config.ts # Build configuration
โโโ vitest.config.ts # Test configuration
โโโ package.json
โโโ README.md
Dependencies
Runtime Dependencies
- @modelcontextprotocol/sdk: Model Context Protocol server implementation
- @mozilla/readability: Content extraction from HTML
- cheerio: HTML parsing for search results
- commander: CLI argument parsing
- dompurify: HTML sanitization
- dotenv: Environment variable loading
- jsdom: DOM manipulation for server-side processing
- playwright: Browser automation for JS rendering
- turndown: HTML to Markdown conversion
Development Dependencies
- typescript: TypeScript compiler
- tsup: Fast TypeScript bundler
- vitest: Fast unit test framework
- @types/*: TypeScript type definitions
License
MIT License - see LICENSE file for details.
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
Roadmap
Planned Features
- Enhanced MCP Tools: Additional specialized search tools for documentation, APIs, etc.
- Caching Layer: SQLite-based URL โ Markdown caching with 24-hour TTL
- Search Engine Abstraction: Support for Brave Search, Bing, and other engines
- Content Summarization: Optional AI-powered content summarization
- Export Formats: JSON, plain text, and other output formats
- Batch Processing: Process multiple URLs from file input
- SSE Transport: Support for Server-Sent Events MCP transport
Performance Improvements
- Parallel Processing: Concurrent content extraction for multiple results
- Smart Caching: Intelligent cache invalidation based on content freshness
- Memory Optimization: Streaming content processing for large documents