zoharbabin/google-research-mcp
If you are the rightful owner of google-research-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
The Google Researcher MCP Server empowers AI assistants with web research capabilities through Google Search, content scraping, and Gemini AI analysis.
Google Researcher MCP Server
Empower AI assistants with robust, persistent, and secure web research capabilities.
This server implements the Model Context Protocol (MCP), providing a suite of tools for Google Search, content scraping, and Gemini AI analysis. It's designed for performance and reliability, featuring a persistent caching system, comprehensive timeout handling, and enterprise-grade security.
š Latest Update (v1.2.1): Fixed critical issue where
scrape_page
andresearch_topic
tools were returning placeholder test content instead of actual scraped data. All tools now return real web content as expected.
Table of Contents
- Why Use This Server?
- Features
- System Architecture
- YouTube Transcript Extraction
- Getting Started
- Usage
- Performance & Reliability
- Security
- Testing
- Troubleshooting
- Contributing
- License
Why Use This Server?
- Extend AI Capabilities: Grant AI assistants access to real-time web information and powerful analytical tools.
- Maximize Performance: Drastically reduce latency for repeated queries with a sophisticated two-layer persistent cache (in-memory and disk).
- Reduce Costs: Minimize expensive API calls to Google Search and Gemini by caching results.
- Ensure Reliability: Prevent failures and ensure consistent performance with comprehensive timeout handling and graceful degradation.
- Flexible & Secure Integration: Connect any MCP-compatible client via STDIO or HTTP+SSE, with enterprise-grade OAuth 2.1 for secure API access.
- Open & Extensible: MIT licensed, fully open-source, and designed for easy modification and extension.
Features
- Core Research Tools:
google_search
: Find information using the Google Search API.scrape_page
: Extract content from websites and YouTube videos with robust transcript extraction.analyze_with_gemini
: Process text using Google's powerful Gemini AI models.research_topic
: A composite tool that combines search, scraping, and analysis into a single, efficient operation.
- YouTube Transcript Extraction:
- Robust YouTube transcript extraction with comprehensive error handling: 10 distinct error types with clear, actionable messages.
- Intelligent retry logic with exponential backoff: Automatic retries for transient failures (network issues, rate limiting, timeouts).
- User-friendly error messages and diagnostics: Clear feedback when transcript extraction fails, with specific reasons.
- Advanced Caching System:
- Two-Layer Cache: Combines a fast in-memory cache for immediate access with a persistent disk-based cache for durability.
- Custom Namespaces: Organizes cached data by tool, preventing collisions and simplifying management.
- Manual & Automated Persistence: Offers both automatic, time-based cache saving and manual persistence via a secure API endpoint.
- Robust Performance & Reliability:
- Comprehensive Timeouts: Protects against network issues and slow responses from external APIs.
- Graceful Degradation: Ensures the server remains responsive even if a tool or dependency fails.
- Dual Transport Protocols: Supports both
STDIO
for local process communication andHTTP+SSE
for web-based clients.
- Enterprise-Grade Security:
- OAuth 2.1 Protection: Secures all HTTP endpoints with modern, industry-standard authorization.
- Granular Scopes: Provides fine-grained control over access to tools and administrative functions.
- Monitoring & Management:
- Administrative API: Exposes endpoints for monitoring cache statistics, managing the cache, and inspecting the event store.
System Architecture
The server is built on a layered architecture designed for clarity, separation of concerns, and extensibility.
graph TD
subgraph "Client"
A[MCP Client]
end
subgraph "Transport Layer"
B[STDIO]
C[HTTP-SSE]
end
subgraph "Core Logic"
D{MCP Request Router}
E[Tool Executor]
end
subgraph "Tools"
F[google_search]
G[scrape_page]
H[analyze_with_gemini]
I[research_topic]
end
subgraph "Support Systems"
J[Persistent Cache]
K[Event Store]
L[OAuth Middleware]
end
A -- Connects via --> B
A -- Connects via --> C
B -- Forwards to --> D
C -- Forwards to --> D
D -- Routes to --> E
E -- Invokes --> F
E -- Invokes --> G
E -- Invokes --> H
E -- Invokes --> I
F & G & H & I -- Uses --> J
D -- Uses --> K
C -- Protected by --> L
style J fill:#f9f,stroke:#333,stroke-width:2px
style K fill:#ccf,stroke:#333,stroke-width:2px
style L fill:#f99,stroke:#333,stroke-width:2px
For a more detailed explanation, see the .
YouTube Transcript Extraction
The server includes a robust YouTube transcript extraction system that provides reliable access to video transcripts with comprehensive error handling and automatic recovery mechanisms.
Key Features
- Comprehensive Error Classification: Identifies 10 distinct error types with clear, actionable messages
- Intelligent Retry Logic: Exponential backoff mechanism for transient failures (max 3 attempts)
- Production Optimizations: 91% performance improvement and 80% log reduction
- User-Friendly Feedback: Clear error messages explaining why transcript extraction failed
Supported Error Types
Error Code | Description | User Action |
---|---|---|
TRANSCRIPT_DISABLED | Video owner disabled transcripts | Try a different video |
VIDEO_UNAVAILABLE | Video no longer available | Verify the URL and video status |
VIDEO_NOT_FOUND | Invalid video ID or URL | Check the YouTube URL format |
NETWORK_ERROR | Network connectivity issues | System will retry automatically |
RATE_LIMITED | YouTube API rate limiting | System will retry with backoff |
TIMEOUT | Request timed out | System will retry automatically |
PARSING_ERROR | Transcript data parsing failed | Contact support if persistent |
REGION_BLOCKED | Video blocked in server region | Use proxy if needed |
PRIVATE_VIDEO | Video requires authentication | Use public videos only |
UNKNOWN | Unexpected error occurred | Contact support with details |
Retry Behavior
The system automatically retries failed requests for transient errors:
- Maximum Attempts: 3 retries for
NETWORK_ERROR
,RATE_LIMITED
, andTIMEOUT
- Exponential Backoff: Progressive delays between retries to avoid overwhelming YouTube's API
- Smart Recovery: Only retries errors that are likely to succeed on subsequent attempts
Example Error Messages
When transcript extraction fails, users receive clear, specific error messages:
Failed to retrieve YouTube transcript for https://www.youtube.com/watch?v=xxxx.
Reason: TRANSCRIPT_DISABLED - The video owner has disabled transcripts.
Failed to retrieve YouTube transcript for https://www.youtube.com/watch?v=xxxx after 3 attempts.
Reason: NETWORK_ERROR - A network error occurred.
For complete technical details, see the .
Getting Started
Prerequisites
- Node.js: Version 18.0.0 or higher.
- API Keys:
- OAuth 2.1 Provider (for HTTP transport): An external authorization server (e.g., Auth0, Okta) to issue JWTs.
Installation & Setup
-
Clone the Repository:
git clone https://github.com/zoharbabin/google-research-mcp.git cd google-researcher-mcp
-
Install Dependencies:
npm install
-
Configure Environment Variables: Create a
.env
file by copying the example and filling in your credentials.cp .env.example .env
Now, open
.env
in your editor and add your API keys and OAuth configuration. See the comments in.env.example
for detailed explanations of each variable.
Running the Server
-
Development Mode: For development with automatic reloading on file changes, use:
npm run dev
This command uses
tsx
to watch for changes and restart the server. -
Production Mode: First, build the TypeScript project into JavaScript, then start the server:
npm run build npm start
Upon successful startup, you will see confirmation that the transports are ready:
ā
stdio transport ready
š SSE server listening on http://127.0.0.1:3000/mcp
Usage
Available Tools
The server provides a suite of powerful tools for research and analysis. Each tool is designed with detailed descriptions and annotations to be easily understood and utilized by AI models.
Tool | Title | Description & Parameters |
---|---|---|
google_search | Google Web Search | Description: Searches the web using the Google Custom Search API to find relevant web pages and resources. Ideal for finding current information, discovering authoritative sources, and locating specific documents. Results are cached for 30 minutes. Parameters: - query (string, required): The search query. Use specific, targeted keywords for best results.- num_results (number, optional, default: 5): The number of search results to return (1-10). |
scrape_page | Web Page & YouTube Content Extractor | Description: Extracts text content from web pages and YouTube videos with robust transcript extraction capabilities. Features comprehensive error handling with 10 distinct error types (TRANSCRIPT_DISABLED, VIDEO_UNAVAILABLE, NETWORK_ERROR, etc.), automatic retry logic with exponential backoff for transient failures, and user-friendly error messages. Supports both youtube.com/watch?v= and youtu.be/ URL formats. Results are cached for 1 hour. Parameters: - url (string, required): The URL of the web page or YouTube video to scrape. YouTube URLs automatically extract transcripts when available. |
analyze_with_gemini | Gemini AI Text Analysis | Description: Processes and analyzes text content using Google's Gemini AI models. It can summarize, answer questions, and generate insights from provided text. Large texts are automatically truncated. Results are cached for 15 minutes. Parameters: - text (string, required): The text content to analyze.- model (string, optional, default: "gemini-2.0-flash-001"): The Gemini model to use (e.g., gemini-2.0-flash-001 , gemini-pro ). |
research_topic | Comprehensive Topic Research Workflow | Description: A powerful composite tool that automates the entire research process: it searches for a topic, scrapes the content from multiple sources, and synthesizes the findings with Gemini AI. It's designed for resilience and provides comprehensive analysis. Parameters: - query (string, required): The research topic or question.- num_results (number, optional, default: 3): The number of sources to research (recommended: 2-5). |
Client Integration
STDIO Client (Local Process)
Ideal for local tools and CLI applications.
import { Client } from "@modelcontextprotocol/sdk/client/index.js";
import { StdioClientTransport } from "@modelcontextprotocol/sdk/client/stdio.js";
const transport = new StdioClientTransport({
command: "node",
args: ["dist/server.js"]
});
const client = new Client({ name: "test-client" });
await client.connect(transport);
const result = await client.callTool({
name: "google_search",
arguments: { query: "Model Context Protocol" }
});
console.log(result.content[0].text);
// YouTube transcript extraction example
const youtubeResult = await client.callTool({
name: "scrape_page",
arguments: { url: "https://www.youtube.com/watch?v=dQw4w9WgXcQ" }
});
console.log(youtubeResult.content[0].text);
HTTP+SSE Client (Web Application)
Suitable for web-based clients. Requires a valid OAuth 2.1 Bearer token.
import { Client } from "@modelcontextprotocol/sdk/client/index.js";
import { StreamableHTTPClientTransport } from "@modelcontextprotocol/sdk/client/streamableHttp.js";
// The client MUST obtain a valid OAuth 2.1 Bearer token from your
// configured external Authorization Server before making requests.
const transport = new StreamableHTTPClientTransport(
new URL("http://localhost:3000/mcp"),
{
getAuthorization: async () => `Bearer YOUR_ACCESS_TOKEN`
}
);
const client = new Client({ name: "test-client" });
await client.connect(transport);
const result = await client.callTool({
name: "google_search",
arguments: { query: "Model Context Protocol" }
});
console.log(result.content[0].text);
// YouTube transcript extraction with error handling
try {
const youtubeResult = await client.callTool({
name: "scrape_page",
arguments: { url: "https://www.youtube.com/watch?v=dQw4w9WgXcQ" }
});
console.log("Transcript:", youtubeResult.content[0].text);
} catch (error) {
if (error.content && error.content[0].text.includes("TRANSCRIPT_DISABLED")) {
console.log("Video owner has disabled transcripts");
} else if (error.content && error.content[0].text.includes("VIDEO_NOT_FOUND")) {
console.log("Video not found - check the URL");
} else {
console.log("Transcript extraction failed:", error.content[0].text);
}
}
Management API
The server provides several administrative endpoints for monitoring and control. Access to these endpoints is protected by OAuth scopes.
Method | Endpoint | Description | Required Scope |
---|---|---|---|
GET | /mcp/cache-stats | View cache performance statistics. | mcp:admin:cache:read |
GET | /mcp/event-store-stats | View event store usage statistics. | mcp:admin:event-store:read |
POST | /mcp/cache-invalidate | Clear specific cache entries. | mcp:admin:cache:invalidate |
POST | /mcp/cache-persist | Force the cache to be saved to disk. | mcp:admin:cache:persist |
GET | /mcp/oauth-scopes | Get documentation for all OAuth scopes. | Public |
GET | /mcp/oauth-config | View the server's OAuth configuration. | mcp:admin:config:read |
GET | /mcp/oauth-token-info | View details of the provided token. | Requires authentication |
Performance & Reliability
The server has been optimized for production use with significant performance improvements and reliability enhancements:
YouTube Transcript Extraction Performance
- 91% Performance Improvement: End-to-end tests for YouTube transcript extraction are now 91% faster
- 80% Log Reduction: Streamlined logging reduces noise while maintaining diagnostic capabilities
- Production Controls: Environment-based configuration allows fine-tuning of retry behavior and timeouts
System Reliability
- Intelligent Error Recovery: Automatic retry with exponential backoff for transient failures
- Graceful Degradation: The system continues operating even when individual components encounter issues
- Comprehensive Error Classification: 10 distinct error types provide precise feedback for troubleshooting
- Resource Optimization: Efficient memory and CPU usage patterns for high-volume operations
Monitoring & Diagnostics
- Enhanced Logging: Detailed but efficient logging for production debugging
- Performance Metrics: Built-in performance tracking for all major operations
- Error Analytics: Structured error reporting for operational insights
These optimizations ensure the server can handle production workloads efficiently while providing reliable service even under adverse conditions.
Security
OAuth 2.1 Authorization
The server implements OAuth 2.1 authorization for all HTTP-based communication, ensuring that only authenticated and authorized clients can access its capabilities.
- Protection: All endpoints under
/mcp/
(except for public documentation endpoints) are protected. - Token Validation: The server validates JWTs (JSON Web Tokens) against the configured JWKS (JSON Web Key Set) URI from your authorization server.
- Scope Enforcement: Each tool and administrative action is mapped to a specific OAuth scope, providing granular control over permissions.
For a complete guide on setting up OAuth, see the .
Available Scopes
Tool Execution Scopes
mcp:tool:google_search:execute
mcp:tool:scrape_page:execute
mcp:tool:analyze_with_gemini:execute
mcp:tool:research_topic:execute
Administrative Scopes
mcp:admin:cache:read
mcp:admin:cache:invalidate
mcp:admin:cache:persist
mcp:admin:event-store:read
mcp:admin:config:read
Testing
The project maintains a high standard of quality through a combination of end-to-end and focused component tests.
Script | Description |
---|---|
npm test | Runs all focused component tests (*.spec.ts ) using Jest. |
npm run test:e2e | Executes the full end-to-end test suite for both STDIO and SSE transports. |
npm run test:coverage | Generates a detailed code coverage report. |
For more details on the testing philosophy and structure, see the .
Troubleshooting
Method | Endpoint | Description | Required Scope |
---|---|---|---|
GET | /mcp/cache-stats | View cache performance statistics. | mcp:admin:cache:read |
GET | /mcp/event-store-stats | View event store usage statistics. | mcp:admin:event-store:read |
POST | /mcp/cache-invalidate | Clear specific cache entries. | mcp:admin:cache:invalidate |
POST | /mcp/cache-persist | Force the cache to be saved to disk. | mcp:admin:cache:persist |
GET | /mcp/oauth-scopes | Get documentation for all OAuth scopes. | Public |
GET | /mcp/oauth-config | View the server's OAuth configuration. | mcp:admin:config:read |
GET | /mcp/oauth-token-info | View details of the provided token. | Requires authentication |
Contributing
We welcome contributions of all kinds! This project is open-source under the MIT license and we believe in the power of community collaboration.
- ā Star this repo if you find it useful.
- š“ Fork it to create your own version.
- š” Report issues if you find bugs or have suggestions for improvements.
- š Submit PRs for bug fixes, new features, or documentation enhancements.
To contribute code, please follow our .
License
This project is licensed under the MIT License. See the file for details.